pith. sign in

arxiv: 2605.22765 · v1 · pith:NIZTTQOXnew · submitted 2026-05-21 · 💻 cs.LG · stat.ML

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

Pith reviewed 2026-05-22 06:40 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords discrete diffusionuniform diffusionmasked diffusionleave-one-out posteriorabsorbing statedenoisinglanguage modelingreverse dynamics
0
0 comments X

The pith

Uniform diffusion models are optimized by a leave-one-out posterior rather than the direct denoising posterior, and an absorbing-state version matches masked diffusion performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the usual plug-in parameterization in uniform diffusion models actually trains a leave-one-out predictor that ignores each token's own noisy observation. This creates a mismatch with the standard cross-entropy training objective. By deriving exact conversions between the denoiser, leave-one-out posterior, and score, the authors separate parameterization choices from the training target. They also present an absorbing-state reformulation that keeps the original joint distribution but breaks sampling into simpler masked-diffusion-style steps with carry-over unmasking and remasking. Experiments on language modeling show consistent gains from the leave-one-out approach and performance that matches or exceeds masked diffusion.

Core claim

In uniform diffusion models the standard plug-in bridge is optimized by a leave-one-out posterior that predicts each clean token without using its own noisy version. Exact conversions exist between the denoiser output, this leave-one-out posterior, and the score function. These conversions allow an informed predictor-corrector sampler and improved temperature sampling at inference time with no retraining. An absorbing-state reformulation preserves the original UDM joint law while reducing sampling to masked-diffusion-like operations that have simpler posteriors, natural carry-over unmasking, and a built-in remasking mechanism.

What carries the argument

The leave-one-out posterior, which predicts each clean token from all other noisy observations while ignoring its own.

If this is right

  • Leave-one-out parameterizations improve generation quality for uniform diffusion on language modeling tasks.
  • The absorbing-state construction achieves performance that matches or exceeds masked diffusion models.
  • An informed predictor-corrector sampler and temperature sampling based on the leave-one-out predictor improve inference without any retraining.
  • The conversions between denoiser, leave-one-out posterior, and score disentangle parameterization choices from the training objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mismatch between plug-in parameterization and true denoising posterior may appear in other discrete diffusion settings beyond language.
  • The absorbing reformulation could reduce implementation complexity when porting uniform diffusion code to new data types.
  • Improved sampling from the leave-one-out predictor might generalize to continuous diffusion models that use analogous bridge constructions.

Load-bearing premise

The absorbing-state reformulation keeps exactly the same joint probability law over sequences as the original uniform diffusion process.

What would settle it

Train a standard UDM and a leave-one-out version on the same language dataset, then measure whether the leave-one-out version produces lower perplexity or higher-quality generated text under the same sampling budget.

Figures

Figures reproduced from arXiv: 2605.22765 by Alain Durmus, Dario Shariatian, Eric Moulines, Eric P. Xing, Samson Gourevitch, Umut Simsekli, Yazid Janati.

Figure 1
Figure 1. Figure 1: Comparison between the denoiser and leave-one-out parameterizations. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top-p sampling Gen-PPL frontier, obtained by sweeping (p ∈ [0.8, 1.0]). Top-p sampling is applied to the denoiser, denoiser converted into LOO denoiser and LOO denoiser. Predictor-corrector We next evaluate the predictor-corrector sampler (Algorithm 5) described at the end of Section 3.2. As detailed in Appendix E, the LOO parameterization gives access to a corrector step without training an auxiliary mode… view at source ↗
Figure 3
Figure 3. Figure 3: Gen-PPL frontiers. Temperature sampling Gen-PPL frontier, obtained by sweeping the temperature [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Sudoku solve rate as a function of NFE for the best learning rate of each method. Right: AUDM, resampled AUDM, UDM, and MDM Gen-PPL frontier. The vertical line corresponds to the dataset entropy. The Sudoku results in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between the leave-one-out posterior [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of the leave-one-out prediction to the local observation [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Generative frontiers with varying temperature across NFEs for the denoiser and leave-one [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top-p frontiers across all NFEs for the denoiser and leave-one-out parameterizations [PITH_FULL_IMAGE:figures/full_fig_p044_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison between predictor-corrector sampling and top- [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Full frontier comparison for AUDM, resampled AUDM, UDM, and MDM across all [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗
read the original abstract

Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. We characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor. We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. On language modeling, leave-one-out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at https://github.com/samsongourevitch/rev_udm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in Uniform Diffusion Models (UDM) the standard plug-in bridge parameterization is optimized by a leave-one-out posterior rather than the denoising posterior, identifying a mismatch with the cross-entropy objective. It derives exact conversions between the denoiser, leave-one-out posterior, and score. It further introduces an absorbing-state reformulation that preserves the original UDM joint law while enabling masked-diffusion-like sampling operations with simpler posteriors, carry-over unmasking, and natural remasking. Empirical results on language modeling show consistent improvements from leave-one-out parameterizations and that the absorbing construction matches or surpasses masked diffusion.

Significance. If the derivations hold and the reformulation exactly preserves the joint law, the work clarifies why masked and uniform diffusion differ in practice, attributing gaps more to parameterization and sampling than to marginal choices. The leave-one-out predictor, informed predictor-corrector sampler, and temperature sampling offer training-free inference gains. Code and model release aids reproducibility.

major comments (2)
  1. Abstract, second paragraph: the central claim that the absorbing-state reformulation 'preserves the UDM joint law' while decomposing sampling into masked-diffusion-like operations with carry-over unmasking and remasking lacks an explicit derivation of forward-kernel equivalence or finite-time transition-matrix equality. This equivalence is load-bearing; without a lemma showing identical marginals at every t (or infinitesimal-generator agreement), it is unclear whether the processes remain equivalent or diverge at O(dt) when remasking probability is nonzero in the continuous-time limit.
  2. Derivations of conversions (referenced in abstract): the exact conversions between denoiser, leave-one-out posterior, and score are used to disentangle parameterization from objective. A concrete check is needed that these conversions do not reduce by construction to quantities already fitted by the training objective, to confirm they provide new information rather than tautological reparameterization.
minor comments (2)
  1. Notation for the leave-one-out posterior should be introduced with an explicit equation early in the main text rather than only in the abstract.
  2. The experimental section would benefit from reporting variance across multiple runs or statistical tests for the reported generation improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. We address each major comment in detail below. Where the comments identify opportunities for greater clarity or additional supporting material, we have revised the paper accordingly.

read point-by-point responses
  1. Referee: Abstract, second paragraph: the central claim that the absorbing-state reformulation 'preserves the UDM joint law' while decomposing sampling into masked-diffusion-like operations with carry-over unmasking and remasking lacks an explicit derivation of forward-kernel equivalence or finite-time transition-matrix equality. This equivalence is load-bearing; without a lemma showing identical marginals at every t (or infinitesimal-generator agreement), it is unclear whether the processes remain equivalent or diverge at O(dt) when remasking probability is nonzero in the continuous-time limit.

    Authors: We thank the referee for this important observation. Section 4 of the manuscript constructs the absorbing-state reformulation by re-expressing the uniform forward process as a mixture of an absorbing state and a masked diffusion process, with the reverse process using carry-over unmasking and a natural remasking step. The construction is designed so that the joint law over clean and noisy sequences remains identical to the original UDM at every finite time. To make the equivalence fully rigorous in the continuous-time setting, we will add a new lemma (Lemma 4.1) that explicitly equates the infinitesimal generators of the two processes and proves that the finite-time marginal distributions coincide for all t, including when the remasking probability is strictly positive. The proof proceeds by showing that the transition rates match exactly and that the resulting Kolmogorov forward equations yield identical solutions. We agree this lemma strengthens the presentation and removes any ambiguity about O(dt) discrepancies. revision: yes

  2. Referee: Derivations of conversions (referenced in abstract): the exact conversions between denoiser, leave-one-out posterior, and score are used to disentangle parameterization from objective. A concrete check is needed that these conversions do not reduce by construction to quantities already fitted by the training objective, to confirm they provide new information rather than tautological reparameterization.

    Authors: We appreciate the request for an explicit non-tautological check. The model is trained by minimizing cross-entropy against the denoising posterior. The leave-one-out posterior, however, is obtained by analytically removing the contribution of the token’s own noisy observation from the denoiser output (see Equations 3–5). This adjustment is not part of the training loss and therefore yields a distinct predictor. In the revised manuscript we will insert a short remark and a small numerical illustration in Section 3.3: we apply the conversion formulas to a trained denoiser on a validation batch and show that the resulting leave-one-out probabilities differ measurably from the raw denoising probabilities; we further demonstrate that substituting the leave-one-out predictor into the informed predictor-corrector sampler produces the reported generation improvements. These results confirm that the conversions extract usable information beyond what is directly optimized by the training objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives explicit conversions between the denoiser, leave-one-out posterior, and score, then introduces an absorbing-state reformulation asserted to preserve the original UDM joint law while enabling masked-diffusion-style operations. These steps are presented as independent technical results with stated characterizations and empirical outcomes on language modeling, rather than tautological reductions to fitted inputs, self-definitions, or load-bearing self-citations. No equations or claims in the abstract reduce a prediction or central result to its own construction by definition; the derivations appear self-contained against external benchmarks and falsifiable via the reported generation improvements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

From the abstract alone, the central claims rest on the existence of exact conversions between denoiser, leave-one-out posterior and score, and on the absorbing reformulation preserving the UDM joint law; no explicit free parameters, standard mathematical axioms, or new invented entities with independent evidence are detailed.

invented entities (1)
  • absorbing-state reformulation no independent evidence
    purpose: Decompose uniform diffusion into masked-diffusion-like sampling operations while preserving the original joint law
    Introduced to enable simpler denoising posteriors, carry-over unmasking and natural remasking; no external falsifiable evidence provided in abstract

pith-pipeline@v0.9.0 · 5836 in / 1379 out tokens · 48965 ms · 2026-05-22T06:40:50.499383+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 11 internal anchors

  1. [1]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  2. [2]

    A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

    Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

  3. [3]

    Gen- erative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,

    Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gen- erative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024. URLhttps://arxiv.org/abs/2402.04997

  4. [4]

    One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling, 2014. URLhttps://arxiv.org/abs/1312.3005

  5. [5]

    Fast sampling via discrete non-markov diffusion models with predetermined transition time

    Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via discrete non-markov diffusion models with predetermined transition time. Advances in Neural Information Processing Systems, 37:106870–106905, 2024

  6. [6]

    A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

    Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summariza- tion of long documents, 2018. URLhttps://arxiv.org/abs/1804.05685

  7. [7]

    The Diffusion Duality, Chapter II: $\Psi$-Samplers

    Justin Deschenaux, Caglar Gulcehre, and Subham Sekhar Sahoo. The diffusion duality, chapter ii:ψ-samplers and efficient curriculum, 2026. URLhttps://arxiv.org/abs/2602.21185

  8. [8]

    Openwebtext corpus.http://Skylion007.github.io/ OpenWebTextCorpus, 2019

    Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/ OpenWebTextCorpus, 2019

  9. [9]

    Gemini Diffusion: Google DeepMind’s experimental research model.https://blog.google/technology/google-deepmind/gemini-diffusion/, May 2025

    Google DeepMind. Gemini Diffusion: Google DeepMind’s experimental research model.https://blog.google/technology/google-deepmind/gemini-diffusion/, May 2025. Accessed: 2026-05-06

  10. [10]

    Vector quantized diffusion model for text-to-image synthesis, 2022

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis, 2022. URL https://arxiv.org/abs/2111.14822

  11. [11]

    The curious case of neural text degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH. 13

  12. [12]

    Argmax flows: Learning categorical distributions with normalizing flows

    Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr ´e, and Max Welling. Argmax flows: Learning categorical distributions with normalizing flows. InThird Symposium on Advances in Approximate Bayesian Inference, 2021

  13. [13]

    Analyzing hogwild parallel gaus- sian gibbs sampling

    Matthew J Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel gaus- sian gibbs sampling. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2013/file/b51a15f38...

  14. [14]

    Mercury: Ultra-fast language models based on diffusion,

    Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and V olodymyr Kuleshov. Mercury: Ultra-fast language models based on diffusion,

  15. [15]

    URLhttps://arxiv.org/abs/2506.17298

  16. [16]

    Think while you generate: Discrete diffusion with planned denoising

    Sulin Liu, Juno Nam, Andrew Campbell, Hannes St¨ark, Yilun Xu, Tommi Jaakkola, and Rafael G´omez-Bombarelli. Think while you generate: Discrete diffusion with planned denoising. arXiv preprint arXiv:2410.06264, 2024

  17. [17]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning, pages 32819–32848. PMLR, 2024

  18. [18]

    Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

    Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large an- notated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330,

  19. [19]

    URLhttps://aclanthology.org/J93-2004/

  20. [20]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. URLhttps://arxiv.org/abs/1609.07843

  21. [21]

    Scaling up masked diffusion models on text

    Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=WNvvwK0tut

  22. [22]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. URLhttps: //arxiv.org/abs/2502.09992

  23. [23]

    Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

  24. [24]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URLhttps: //arxiv.org/abs/1606.06031

  25. [25]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

  26. [26]

    Discrete markov probabilistic models: An improved discrete score- based framework with sharp convergence bounds under minimal assumptions

    Le-Tuyet-Nhi PHAM, Dario Shariatian, Antonio Ocello, Giovanni Conforti, and Alain Oliviero Durmus. Discrete markov probabilistic models: An improved discrete score- based framework with sharp convergence bounds under minimal assumptions. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=biJiSMLGOV

  27. [27]

    Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

    Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Generative frontiers: Why evaluation matters for diffusion language models, 2026. URLhttps://arxiv.org/abs/2604.02718

  28. [28]

    Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

    Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. 14

  29. [29]

    The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

  30. [30]

    Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024

    Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024

  31. [31]

    Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

  32. [32]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=St1giarCHLP

  33. [33]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  34. [34]

    Score-based continuous- time discrete diffusion models

    Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous- time discrete diffusion models. InThe Eleventh International Conference on Learning Repre- sentations, 2023. URLhttps://openreview.net/forum?id=BYWWwSY2G5s

  35. [35]

    Scaling behavior of discrete diffusion language models

    Dimitri von R ¨utte, Janis Fluri, Omead Pooladzandi, Bernhard Sch ¨olkopf, Thomas Hof- mann, and Antonio Orvieto. Scaling behavior of discrete diffusion language models. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=GDYaNzxt9T

  36. [36]

    Generalized interpolating discrete diffusion, 2025

    Dimitri von R ¨utte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Sch¨olkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion, 2025. URLhttps://arxiv.org/ abs/2503.04482

  37. [37]

    S., and Kuleshov, V

    Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling, 2026. URLhttps://arxiv.org/ abs/2503.00307

  38. [38]

    Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,

    Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning, 2025. URLhttps://arxiv.org/abs/2410.14157

  39. [39]

    Character-level Convolutional Networks for Text Classification

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626

  40. [40]

    Informed correctors for discrete diffusion models, 2025

    Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models, 2025. URLhttps://arxiv.org/abs/ 2407.21243

  41. [41]

    Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

    Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

  42. [42]

    A reparameterized discrete diffusion model for text generation, 2024

    Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation, 2024. URLhttps://arxiv.org/abs/2302.05737. 15 Appendix Outline The appendix is organized as follows. Appendix A proves the leave-one-out optimality result, gives the conversion formulas between the leave-one-out denoiser, denoiser and score, ...

  43. [43]

    Outside this support, the conditional densities may be defined arbitrarily. Proposition 5.It holds for anyx t such thatp t(xt)>0, pℓ 0|t(xℓ 0|xt) =p t(x−ℓ t )qℓ t|0(xℓ t|xℓ 0)p loo,ℓ 0|t (xℓ 0|x−ℓ t )/pt(xt).(24) Conversely, suppose thatq ℓ t|0(xℓ t|xℓ 0)>0for anyx 0 andx t, it holds for anyx −ℓ t ,p t(x−ℓ t )>0, ploo,ℓ 0|t (xℓ 0|x−ℓ t ) = pt(xt)pℓ 0|t(xℓ...

  44. [44]

    In UDM this condition is satisfied for everyt >0, so the two repre- sentations can always be converted into one another

    (26) Therefore, the conversion from the denoiser to the leave-one-out posterior is available exactly when the forward has full support. In UDM this condition is satisfied for everyt >0, so the two repre- sentations can always be converted into one another. In MDM, by contrast, the condition fails for unmasked positions. Ifx ℓ t =m, the likelihood is const...

  45. [45]

    For MDM, the conversion is explicit only on the support of the forward process

    =α t⟨xℓ t,x ℓ 0⟩+ 1−α t K , the first identity above yields pℓ 0|t(·|xt) = Cat (1−α t)ˆxloo 0|t (xt)ℓ +Kα t⟨xℓ t, ˆxloo 0|t (xt)ℓ⟩xℓ t 1−α t +Kα t⟨xℓ t, ˆxloo 0|t (xt)ℓ⟩ .(27) Conversely, the inverse relation can be written as ˆxloo 0|t (xt)ℓ = (1 + (K−1)α t) ˆx0|t(xt)ℓ −Kα t⟨xℓ t, ˆx0|t(xt)ℓ⟩xℓ t 1 + (K−1)α t −Kα t⟨xℓ t, ˆx0|t(xt)ℓ⟩ .(28) Where ˆx0|t(xt)...

  46. [46]

    If insteadx ℓ t ̸=m, thenq ℓ t|0(xℓ t|xℓ

    = 1−α t for everyx ℓ 0 ∈V, so pℓ 0|t(·|xt) = Cat(ˆxloo 0|t (xt)ℓ). If insteadx ℓ t ̸=m, thenq ℓ t|0(xℓ t|xℓ

  47. [47]

    Therefore, on unmasked positions, the denoiser no longer contains enough information to recon- struct the leave-one-out posterior, so the inverse formula is not available

    =α t1{xℓ 0 =x ℓ t}, hence pℓ 0|t(·|xt) = Cat(xℓ t). Therefore, on unmasked positions, the denoiser no longer contains enough information to recon- struct the leave-one-out posterior, so the inverse formula is not available. Remark 2.When these conversion formulas hold in both directions, as they do for UDM, the uniqueness of the denoiser as a minimizer of...

  48. [48]

    pℓ 0|t(xℓ 0|xt).(29) This already shows that the score can be parameterized from the denoiser. The same quantity can also be expressed directly in terms of the leave-one-out denoiser: ⟨yℓ,s t(xt)ℓ⟩= qℓ t|0(yℓ|ˆxloo 0|t (xt)ℓ) qℓt|0(xℓ t|ˆxloo 0|t (xt)ℓ) ,(30) for everyy∈Xsuch thaty −ℓ =x −ℓ t . Proof.By definition of the score, for everyy∈Xsuch thaty −ℓ =...

  49. [49]

    Therefore, pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t )

    =p t(x−ℓ t )ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Therefore, pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Since the mapν7→q ℓ t|0(yℓ|ν)is affine in its first argument, this becomes pt(y) =p t(x−ℓ t )qℓ t|0(yℓ|ˆxloo 0|t (xt)ℓ). Takingy=x t yields in the same way pt(xt) =p t(x−ℓ t )qℓ t|0(xℓ t|ˆxloo 0|t (xt)ℓ). Dividing the two equations gives ⟨...

  50. [50]

    also considers a LOO denoiser parameterization which follows from Proposition 6. Therefore, if ˆxθ 0(xt, t)is a model for the leave-one-out denoiser, one may use the parameterization pθ(xt, t) =α tˆxθ 0(xt, t) + (1−α t)πℓ .(35) The same restriction is still needed after this reparameterization. If ˆxθ 0(xt, t)ℓ may depend onx ℓ t, the minimizer of (34) ma...

  51. [51]

    Therefore pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t )

    =p t(x−ℓ t )ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Therefore pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t ). 21 Figure 5: Comparison between the leave-one-out posterior ˆxloo 0|t (xt)and the denoising posterior ˆx0|t(xt). Using the affine extension of the forward kernel in its first argument, this becomes pt(y) =p t(x−ℓ t )qℓ t|0(yℓ|ˆxloo 0|t (xt)...

  52. [52]

    Cat(xℓ ti+1;1/K) +1 τ ℓ>ti+1 X xℓ ti+1 Cat(xℓ ti;x ℓ ti+1) Cat(xℓ ti+1;x ℓ 0) =1 τ ℓ≤tiCat(xℓ ti;1/K) +1 ti<τ ℓ≤ti+1Cat(xℓ ti;x ℓ

  53. [53]

    This proves the induction step

    +1 τ ℓ>ti+1Cat(xℓ ti;x ℓ 0) = Cat xℓ ti;1 τ ℓ>ti xℓ 0 +1 τ ℓ≤ti 1/K . This proves the induction step. Hence (50) holds for every time of the grid, and therefore for every t. Lemma 3.If ˜xt(τ) ℓ := xℓ t ifτ ℓ > t, mifτ ℓ ≤t, then p0|t(x0|xt,τ) =p mask 0|t (x0|˜xt(τ)). Proof.By Bayes’ rule and (50), p0|t(x0|xt,τ) = p0(x0)qt|0(xt|x0,τ)P ˜x0 p0(˜x0)qt|0(xt|˜x...

  54. [54]

    canonical

    The lawj s|0 exactly reweights these two possibilities. The lifted reverse transition is then ¯ps|t(xs,τ s|xt,τ t) := X x0 js|0(τs|x0,x s)q s|0,t(xs|x0,x t)p 0|t(x0|xt,τ t).(53) For a grid0 =t 0 <· · ·< t n = 1, let¯p0:n denote the corresponding path law, with initialization ¯ptn(xtn ,τ tn) :=p tn(xtn)jtn(τtn |xtn). Ifα tn = 0, this reduces toX tn ∼υ ⊗L a...

  55. [55]

    and later used by [38]. The idea is to combine two autoregressive streams, one left-to-right and one right-to-left, and to offset the representations so that the output at positionℓnever attends to the input token at the same position, while still depending on all the other positions. Equivalently, in the continuous relaxation used to describe the archite...

  56. [56]

    contract

    at fixed checkpoint, and plotting generative perplexity against the resulting entropy. For top-pfrontiers we usep∈ {0.80,0.85,0.90,0.92,0.94,0.96,0.98,1.00}andNFE∈ {8,16,32,64,128,256,512,1024}. For temperature frontiers we useT∈ {0.80,0.82, . . . ,1.10} over the same NFE grid. The predictor-corrector experiments are run on OWT with the confidence- based ...