Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space

Benjamin J. Zhang; Kelvin Kan; Markos A. Katsoulakis; Stanley Osher; Tuhin Sahai; Xingjian Li

REVIEW 3 major objections 4 minor 1 cited by

Vocabulary size drops out of discrete diffusion convergence bounds

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · deepseek-v4-flash

2026-08-02 13:46 UTC pith:SQ62TMIU

load-bearing objection A serious theory paper with a real S-free IPM bound for discrete diffusion; the central advertised interpretation, however, rests on a pointwise factor-2 score condition that the paper states but does not secure. the 3 major comments →

arxiv 2605.17232 v4 pith:SQ62TMIU submitted 2026-05-17 cs.LG math.STstat.MLstat.TH

Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space

Kelvin Kan , Xingjian Li , Benjamin J. Zhang , Tuhin Sahai , Stanley Osher , Markos A. Katsoulakis This is my paper

classification cs.LG math.STstat.MLstat.TH MSC 60J2768T07

keywords discrete diffusionadjoint equationintegral probability metricdimension-free boundsmasked diffusionuniform ratescore entropycoupling

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to prove dimension-free convergence guarantees for discrete diffusion models: the error between the true data distribution and the generated distribution, measured in any integral probability metric (IPM), is bounded by constants independent of the vocabulary size S. It establishes this for both masked and uniform transition rates, including the singular masked prior where KL-based analyses diverge. The derivation works in the space of observables via the Kolmogorov backward equation, which factors the IPM bound into an initialization error plus a training-error term controlled by score-matching or score-entropy losses. If correct, training on standard score objectives directly minimizes upper bounds for the whole family of IPMs, and step-complexity guarantees extend to any IPM.

Core claim

Working from the Kolmogorov backward equation rather than from probability measures directly, the paper proves Theorem 5.1: under a single integrability assumption on the rate matrix, the IPM between data and generated distributions is at most 2||ψ||∞ times (d e^{-||β||1} + sqrt(2d) sqrt(L_SE + 2/3 L_3)) for masked and uniform rates, with no factor involving the state space size S. The masked case uses score–marginal cancellation (pt(x)st(x)y = pt(y)) and exit-rate routing; the uniform case uses a synchronous coupling argument to control the initialization error without the log S factor that log-Sobolev arguments produce. Corollaries specialize the bound to TV, Wasserstein, MMD and other IPM

What carries the argument

The central object is the adjoint (Kolmogorov backward) equation dφt/dt = Q̃←_t φt with terminal condition φ0 = ψ, whose solution propagates an observable test function through the approximate reverse process. Testing the error equation against φ turns expectation differences into a time-integral of score mismatches (E[(φ(y)−φ(x))Q(y,x)(s−s̃)]), which Cauchy–Schwarz splits into a loss term (L_SE or L_WSM) times an activity factor. The score–marginal identity pt(x)st(x)y = pt(y) lets the masked case re-index sums from predecessors to successors, whose count is ≤ d, removing S; a synchronous coupling of uniform chains with shared Poisson clocks and shared uniform samples removes S from the ini

Load-bearing premise

The S-free masked bound requires the learned score at every reachable state to stay within a factor of two of the true score ratio; if any learned score falls outside [s/2, 3s/2], the cubic-correction inequality that removes the vocabulary-size dependence no longer holds.

What would settle it

Construct a masked diffusion on a small state space (e.g., S=3, d=2) with a deliberately mis-specified score that violates the factor-2 window at one state, and check whether the claimed S-free bound is violated while the general Case 3 bound still holds. More directly, compute L_SE and the exact IPM error for a sequence of vocabularies S → ∞ with fixed d; if the error grows with S even at fixed training loss, the S-independence claim is falsified.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

If the bounds hold, score-matching (L_WSM) and score-entropy (L_SE) training directly minimize upper bounds for TV, W1, MMD, and every IPM simultaneously.
Convergence guarantees for masked diffusion no longer diverge at the singular all-mask prior; KL-based divergence is bypassed.
Step complexity for uniformization becomes S-free: O(d βmax/βmin (log 1/βmax + log(d CΨ βmin/ε))) for masked and O(d (βmax log βmin + log(d CΨ/ε))) for uniform, so the required training loss L_SE = O(ε²/(d CΨ²)) does not tighten as vocabulary grows.
Early-stopping error is additive and S-free: d∫0^δ β(t)dt + d e^{-||β||1} plus a δ-truncated training term.
The framework applies to general rate matrices under a single integrability assumption, covering time-inhomogeneous rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The factor-2 accuracy window on the learned score (s̃ ∈ [s/2, 3s/2]) is implicitly required for the S-free SE bound; if a learned score is badly calibrated on rare tokens, the L3 cubic-correction inequality can fail and the bound no longer follows, though the WSM-based uniform bound without that window remains.
The S-independence leans on the Kronecker-factorization of the forward kernel over tokens; for non-factorizing rates (e.g., dependencies across positions) the successor-set count may exceed d and an S-dependence could reappear.
The coupling proof for uniform transitions is essentially the classical synchronous coupling of lazy random walks on the hypercube; this suggests the same technique could give S-free mixing bounds for other reversible spin systems used as priors.
One could test empirically whether L_SE-based training yields errors that are roughly independent of vocabulary size S for fixed sequence length d, extrapolating the theoretical prediction to finite-sample settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

A serious theory paper with a real S-free IPM bound for discrete diffusion; the central advertised interpretation, however, rests on a pointwise factor-2 score condition that the paper states but does not secure.

read the letter

Read this one. It is a serious theory paper, probably the best current shot at S-free convergence bounds for masked and uniform discrete diffusion. The core Theorem 5.1 is a genuine advance: it uses adjoint equations to bound any IPM by score losses, and the removal of the vocabulary-size factor S is real, not an artifact. The masked-case proof via score-marginal cancellation and exit-rate routing is detailed and internally consistent; I did not find a load-bearing error in the main chain. The unified IPM coverage in Corollary 5.3 and the step-complexity corollaries are useful. No experiments, but that is not a flaw for this genre.

Where I part company with the abstract's confidence: the headline sentence that training directly minimizes an upper bound on all IPMs is only as good as the factor-2 condition. The L_SE bounds in Theorem 5.1 assume \tilde{s}_t(x)_y is within a factor of two of s_t(x)_y on the support of p_tQ_t. Lemma B.12 uses that pointwise condition to bound L_WRSM by 2L_SE + (4/3)L_3. The stress-test note is right: without that condition, the cubic term L_3 can be arbitrarily large while L_SE is arbitrarily small — take one transition with \tilde{s}=Ks and K large. So the theorem is internally valid as an inequality involving L_3, but it does not, by itself, justify the "directly minimizes" claim for masked diffusion. The assumption is stated, so the theorem is not wrong; the interpretation is not secured. The revision should either prove the factor-2 condition under a concrete training procedure or present the bound as a convergence-in-discrepancy result when score error goes to zero, not as a pure training-loss bound.

Minor but real: the mask-probability ODE repeats an initial-condition typo — \alpha_0=1 with solution 1-e^{-\int\beta} is inconsistent (the solution gives \alpha_0=0). It appears in Lemmas B.14, E.1, E.2, and F.2. Also, nonnegativity of the learned score is implicit and should be stated. The "first S-free" claims lean on "to the best of our knowledge," but the Table 1 comparisons look honest.

Bottom line: the central technical result is likely correct and is a solid contribution. It needs a revision to align claims with assumptions, not a rewrite. I would cite it, and I would send it to a serious referee. The proof techniques are worth a reading-group discussion, though the paper is long; the main value is the theorem itself.

Referee Report

3 major / 4 minor

Summary. This paper develops an adjoint-equation (Kolmogorov backward equation) framework for convergence analysis of discrete diffusion models on sequence spaces [S]^d. The main result, Theorem 5.1, bounds the IPM between the data distribution and the approximate reverse-process marginal by an initialization error plus a training-error term. For masked and uniform rate matrices, specialized bounds are claimed to be independent of the vocabulary size S: the masked case uses the score-entropy (SE) loss plus a cubic correction L3, while the uniform case admits both SE-based and weighted-score-matching (WSM) bounds. Corollaries specialize these to TV and general IPMs, add early-stopping errors, and derive expected step complexities for uniformization. The paper also claims that, for the first time, bounds are entirely free of S and apply to singular priors such as the masked distribution, under a single rate-matrix regularity assumption.

Significance. If the main results are valid, this is a substantial contribution. S-free convergence bounds are highly relevant for language-scale discrete diffusion, where S > 10^5 makes existing TV/KL bounds vacuous. The unified IPM treatment is elegant, and the proof structure is detailed: the adjoint calculation, score-marginal cancellation, exit-rate routing, and the coupling argument for the uniform initialization error are all nontrivial and mostly written out. The bounds have explicit constants and no fitted parameters. However, the advertised interpretation that 'training directly and explicitly minimizes an upper bound on the entire family of IPMs' is not fully supported for the masked SE bound as written, because the bound contains a cubic term L3 that is not a training loss and is not shown in the paper to be controlled by L_SE. This is a load-bearing gap for one of the central claims, although it appears fixable with a short additional argument.

major comments (3)

[Theorem 5.1, Case 1; Lemma B.12; Section 5.1] The masked SE bound (19) contains the cubic correction L3, which is not a training loss. The highlighted conclusion that 'model training directly and explicitly minimizes an upper bound on the entire family of IPMs' is therefore not justified as written: nothing in the displayed inequality shows that small L_SE forces the bound small, since L3 is independent of L_SE in the displayed formula. This is not a mere technicality: outside the factor-2 window, L3 can dwarf L_SE. For a single transition with weight ε=K^{-2} and estimator \tilde s = K s, the SE loss contribution scales as K^{-1} while the L3 contribution scales as K, so L3/L_SE grows unboundedly. The theorem is salvageable: under the stated assumption \tilde s ∈ [s/2, 3s/2], one has |s-\tilde s|^3/s^3 ≤ 1/2 |s-\tilde s|^2/s^2, hence L3 ≤ 1/2 L_WRSM ≤ L_SE + (2/3)L3, giving L3 ≤ 3L_SE. This closing argument is absent and should be
[Abstract; Section 1; Theorem 5.1 preamble] The claim that the theory 'relies only on a single standard rate-matrix regularity assumption' is contradicted by Theorem 5.1 itself. All SE-based bounds require the additional pointwise condition \tilde s_t(x)_y ∈ [s_t(x)_y/2, 3s_t(x)_y/2] on the support of p_t Q_t, and the definition of L_SE implicitly requires s_t(x)_y > 0 on that support. The WSM-based bounds avoid the factor-2 condition, but the S-free masked bound is only proved for SE. The abstract and contribution statements should be qualified: the S-free masked result depends on an accuracy assumption on the learned score, not solely on integrability of the rate matrix.
[Corollary 5.5 and its proof (Appendix F)] The step-complexity corollary uses the SE-based bound of Corollary 5.4 and asserts that L^δ_SE = O(ε^2/(d C_Ψ^2)) suffices for ε accuracy. But the bound also contains L^δ_3; without a proof that L^δ_3 = O(L^δ_SE) under the stated assumptions, the conclusion is incomplete. Moreover, the assumption '\tilde s_t(x)_y ≍ s_t(x)_y' is weaker/ambiguous relative to the factor-2 window required by Theorem 5.1; ≍ usually allows arbitrary constants, not specifically [1/2, 3/2]. Quantify the comparability constant and add the L3-control argument (as in the first major comment) before the complexity claim can stand.

minor comments (4)

[Lemma B.14; Appendix E.1; Appendix E.2] The initial condition for α_t is inconsistent. The text states '∂_t α_t = β(t)(1-α_t), α_0=1' with solution 'α_t = 1 - e^{-∫β}'. If α_0=1, the solution is identically 1; the written solution corresponds to α_0=0, which is also the value needed for p_t(m)=α_t^d and for the final bound TV(p_T,δ_m) ≤ d e^{-||β||_1}. This typo appears at least twice and should be corrected.
[Theorem 5.1; Eq. (9)] The score-entropy loss is not defined when s_t(x)_y = 0, since it involves log(s/\tilde s). The theorem should explicitly state that the SE bounds assume s_t(x)_y > 0 on the support of p_t Q_t, or define the loss by a limiting convention for zero scores.
[Corollary 5.5] The statement uses β_min without requiring β_min > 0. If β(t) can vanish on an interval, the displayed choices of δ ~ ε/(d C_Ψ β_min) are not well-defined. Add a nondegeneracy assumption or a modified argument.
[Table 2] For W_1 with Hamming distance, the constant C_Ψ = d/2 implicitly assumes ψ is centered so that its sup-norm is controlled by half the diameter. This normalization should be stated explicitly, since the IPM is invariant to additive constants but Theorem 5.1 uses ||ψ||_∞.

Circularity Check

0 steps flagged

No significant circularity: the IPM bounds are derived from independently defined training losses via self-contained appendix proofs; self-citations are methodological antecedents, not load-bearing inputs.

full rationale

The central Theorem 5.1 bounds the IPM by training losses L_WSM (8) and L_SE (9), which are defined directly from the true discrete score s_t and estimator s̃_t, with no fitted constants and no hidden use of the target IPM. The proof is carried out in Appendix A and B: the adjoint identity (42) follows from the Kolmogorov backward equation (18) plus integration by parts; Lemma B.12 reduces L_WRSM to L_SE + (4/3)L3 using the elementary inequality (49) under the stated factor-2 condition; Lemmas B.2, B.6, B.9, B.14 and Theorem C.4 supply the prefactors. None of these steps re-inserts the desired IPM inequality as an assumption. The only self-citations are to Mimikos-Stamatopoulos et al. (2024) for the continuous-space duality idea and to Birrell et al. (2022) for IPM background; both are antecedents, and the discrete adjoint contraction is re-proved in Lemma B.1, so the citation is not load-bearing. The factor-2 condition s̃ ∈ [s/2, 3s/2] in the preamble of Theorem 5.1 and Lemma B.12 is a genuine limitation: the cubic term L3 is not controlled by L_SE, so the advertised statement that training directly minimizes the IPM upper bound is not secured for masked rates if that condition fails. That is an accuracy/interpretation caveat, not a circular reduction of the bound to its own inputs. The masked and uniform cases specialize from Case 3 via the tensor-product factorization (12); this is an explicit structural assumption, not a renaming of the conclusion. No fitted-input-called-prediction, uniqueness-import, or ansatz-smuggling pattern is present.

Axiom & Free-Parameter Ledger

0 free parameters · 5 axioms · 0 invented entities

No free parameters fitted to data. All numerical choices (β, δ, T, C, CΨ) are algorithm or metric constants chosen by the practitioner, not inferred to make the bound hold. No new entities are introduced.

axioms (5)

domain assumption Rate matrices Q_t have nonnegative off-diagonal entries and zero row sums; Q_t integrable on [0,T] (Assumption 1).
Needed for well-defined CTMC transition kernels and for the generator row-sum zero trick in the adjoint proof (Section 3, Assumption 1).
domain assumption Forward transition kernel factorizes over tokens: Q_t(x,x^{i→x̂^i})=Q_t^{tok}(x_i,x̂_i) (eq. 12).
Used to compute p_t(m)=α_t^d and the d-neighbor/successor counts in the masked/uniform S-free cases.
domain assumption Score estimator \tilde s_t(x)_y is nonnegative and within [s/2, 3s/2] of true score on support of p_tQ_t for L_SE bounds.
Lemma B.5/B.12 requires the ratio bound; nonnegativity makes \tilde Q^← a valid generator.
domain assumption Data distribution p_data is supported on non-mask sequences (masked case), and each token is independently masked.
Gives p_t(m)=α_t^d and the initialization error 1-α_T^d ≤ d e^{-∥β∥_1} (Appendix E.2, Lemma B.14).
standard math Kolmogorov backward equation sup-norm contraction / Markov coupling theorem for CTMCs.
Lemma B.1 via Feynman-Kac or maximum principle; Theorem C.4 via Levin & Peres coupling characterization.

pith-pipeline@v1.3.0-alltime-deepseek · 45377 in / 19201 out tokens · 171296 ms · 2026-08-02T13:46:51.316857+00:00 · methodology

0 comments

read the original abstract

Discrete diffusion has become a leading framework for generative modeling in various applications including language, vision, and biology. Existing convergence theory, however, exhibits fundamental limitations. KL-based analyses diverge under singular priors such as the masked distribution, while bounds in total variation (TV) depend on the state space size $S$ and become vacuous for modern language tasks, where vocabularies contain hundreds of thousands of tokens. We develop a unified adjoint-equation-based framework that establishes dimension-free convergence guarantees in any integral probability metric (IPM). To the best of our knowledge, our bounds are the first to be entirely free of $S$ and applicable to both masked and uniform priors. Importantly, our results can extend existing step complexity guarantees to any IPM. In addition, our theory relies only on a single standard rate-matrix regularity assumption and applies to general priors. Five novel techniques drive our improvements: 1. working in the space of observables via adjoint equations rather than directly with probability measures; 2. a regularity analysis that yields bounds on any IPM; 3. a coupling argument that removes $S$-dependence under uniform transitions; and 4. score-marginal cancellation and 5. exit-routing techniques that remove $S$-dependence under masked transitions. Our framework thus sharply departs from prior analyses and avoids the shortcomings of pathspace-KL and existing TV-based approaches. Beyond convergence bounds, our framework provides a versatile toolkit for further theoretical study of discrete diffusion models, including principled choices of loss functions and dimension-free step complexity.

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
cs.LG 2026-05 unverdicted novelty 6.0

GADD achieves O(polylog(ε^{-1})) sampling complexity for uniform-rate discrete diffusion models via Gibbs correctors derived from the score function, with supporting experiments on text and music.

Reference graph

Works this paper leans on

51 extracted references · 4 linked inside Pith · cited by 1 Pith paper

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

On the numerical analysis of inhomogeneous continuous-time markov chains

Markus Arns, Peter Buchholz, and Andriy Panchenko. On the numerical analysis of inhomogeneous continuous-time markov chains. INFORMS Journal on Computing, 22 0 (3): 0 416--432, 2010

2010
[3]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

2021
[4]

Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet

Jeremiah Birrell, Paul Dupuis, Markos A. Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet. (f,gamma)-divergences: Interpolating between f-divergences and integral probability metrics. Journal of Machine Learning Research, 23 0 (39): 0 1--70, 2022. URL http://jmlr.org/papers/v23/21-0100.html

2022
[5]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35: 0 28266--28279, 2022

2022
[6]

Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=kQwSbv0BR4

2024
[7]

The numerical stability of leaping methods for stochastic simulation of chemically reacting systems

Yang Cao, Linda R Petzold, Muruhan Rathinam, and Daniel T Gillespie. The numerical stability of leaping methods for stochastic simulation of chemically reacting systems. Journal of Chemical Physics, 121 0 (24): 0 12169--12178, 2004

2004
[8]

Efficient step size selection for the tau-leaping simulation method

Yang Cao, Daniel T Gillespie, and Linda R Petzold. Efficient step size selection for the tau-leaping simulation method. The Journal of chemical physics, 124 0 (4), 2006

2006
[9]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 11315--11325, 2022

2022
[10]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023

Pith/arXiv arXiv 2023
[11]

Convergence analysis of discrete diffusion model: Exact implementation through uniformization

Hongrui Chen and Lexing Ying. Convergence analysis of discrete diffusion model: Exact implementation through uniformization. Journal of Machine Learning, 4 0 (2): 0 108--127, June 2025. doi:10.4208/jml.240812. URL https://www.global-sci.com/index.php/jml/article/view/13211

doi:10.4208/jml.240812 2025
[12]

Optimal inference schedules for masked diffusion models

Sitan Chen, Kevin Cong, and Jerry Li. Optimal inference schedules for masked diffusion models. arXiv preprint arXiv:2511.04647, 2025

arXiv 2025
[13]

Fast sampling via discrete non-markov diffusion models with predetermined transition time

Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via discrete non-markov diffusion models with predetermined transition time. Advances in Neural Information Processing Systems, 37: 0 106870--106905, 2024

2024
[14]

Non-asymptotic convergence of discrete diffusion models: Masked and random walk dynamics

Giovanni Conforti, Alain Durmus, Le-Tuyet-Nhi Pham, and Gael Raoul. Non-asymptotic convergence of discrete diffusion models: Masked and random walk dynamics. arXiv preprint arXiv:2512.00580, 2025

arXiv 2025
[15]

Efficient sampling with discrete diffusion models: Sharp and adaptive guarantees

Daniil Dmitriev, Zhihan Huang, and Yuting Wei. Efficient sampling with discrete diffusion models: Sharp and adaptive guarantees. arXiv preprint arXiv:2602.15008, 2026

Pith/arXiv arXiv 2026
[16]

Approximate accelerated stochastic simulation of chemically reacting systems

Daniel T Gillespie. Approximate accelerated stochastic simulation of chemically reacting systems. The Journal of chemical physics, 115 0 (4): 0 1716--1733, 2001

2001
[17]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

2020
[18]

Simulation from endpoint-conditioned, continuous-time markov chains on a finite state space, with applications to molecular evolution

Asger Hobolth and Eric A Stone. Simulation from endpoint-conditioned, continuous-time markov chains on a finite state space, with applications to molecular evolution. The annals of applied statistics, 3 0 (3): 0 1204, 2009

2009
[19]

Argmax flows and multinomial diffusion: Learning categorical distributions

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr \'e , and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in neural information processing systems, 34: 0 12454--12465, 2021

2021
[20]

Reversibility and stochastic networks

Frank P Kelly. Reversibility and stochastic networks. J. Wiley, 1979

1979
[21]

Tabddpm: Modelling tabular data with diffusion models

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with diffusion models. In International conference on machine learning, pp.\ 17564--17579. PMLR, 2023

2023
[22]

Discrete markov probabilistic models: An improved discrete score-based framework with sharp convergence bounds under minimal assumptions

PHAM Le-Tuyet-Nhi, Dario Shariatian, Antonio Ocello, Giovanni Conforti, and Alain Oliviero Durmus. Discrete markov probabilistic models: An improved discrete score-based framework with sharp convergence bounds under minimal assumptions. In Forty-second International Conference on Machine Learning, 2025

2025
[23]

Markov chains and mixing times, volume 107

David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017

2017
[24]

Breaking ar’s sampling bottleneck: Provable acceleration via diffusion language models

Gen Li and Changxiao Cai. Breaking ar’s sampling bottleneck: Provable acceleration via diffusion language models. Advances in Neural Information Processing Systems, 38: 0 11700--11725, 2026

2026
[25]

Neural continuous-time markov chain: Discrete diffusion via decoupled jump timing and direction

Jingyuan Li, Xiaoyi Jiang, Fukang Wen, Wei Liu, Renqian Luo, Yi Zhu, Zuoqiang Shi, and Pipi Hu. Neural continuous-time markov chain: Discrete diffusion via decoupled jump timing and direction. arXiv preprint arXiv:2604.15694, 2026

Pith/arXiv arXiv 2026
[26]

Absorb and converge: Provable convergence guarantee for absorbing discrete diffusion models

Yuchen Liang, Renxiang Huang, Lifeng Lai, Ness Shroff, and Yingbin Liang. Absorb and converge: Provable convergence guarantee for absorbing discrete diffusion models. Advances in Neural Information Processing Systems, 38: 0 20283--20318, 2025

2025
[27]

Discrete diffusion models: Novel analysis and new sampler guarantees

Yuchen Liang, Yingbin Liang, Lifeng Lai, and Ness Shroff. Discrete diffusion models: Novel analysis and new sampler guarantees. Advances in Neural Information Processing Systems, 38: 0 165511--165548, 2026 a

2026
[28]

Sharp convergence rates for masked diffusion models, 2026 b

Yuchen Liang, Zhiheng Tan, Ness Shroff, and Yingbin Liang. Sharp convergence rates for masked diffusion models, 2026 b . URL https://arxiv.org/abs/2602.22505

arXiv 2026
[29]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=CNicRIVIPA

2024
[30]

Concrete score matching: Generalized score matching for discrete data

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35: 0 34532--34545, 2022

2022
[31]

Score-based generative models are provably robust: an uncertainty quantification perspective

Nikiforos Mimikos-Stamatopoulos, Benjamin J Zhang, and Markos A Katsoulakis. Score-based generative models are provably robust: an uncertainty quantification perspective. Advances in Neural Information Processing Systems, 37: 0 63154--63183, 2024

2024
[32]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. Advances in Neural Information Processing Systems, 38: 0 50608--50646, 2026

2026
[33]

Jump your steps: Optimizing sampling schedule of discrete diffusion models

Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, and Yuki Mitsufuji. Jump your steps: Optimizing sampling schedule of discrete diffusion models. In International Conference on Learning Representations, volume 2025, pp.\ 96272--96300, 2025

2025
[34]

Computational optimal transport: With applications to data science

Gabriel Peyr \'e and Marco Cuturi. Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019

2019
[35]

How discrete and continuous diffusion meet: Comprehensive analysis of discrete diffusion models via a stochastic integral framework

Yinuo Ren, Haoxuan Chen, Grant Rotskoff, and Lexing Ying. How discrete and continuous diffusion meet: Comprehensive analysis of discrete diffusion models via a stochastic integral framework. In International Conference on Learning Representations, volume 2025, pp.\ 42904--42941, 2025

2025
[36]

Fast solvers for discrete diffusion models: Theory and applications of high-order algorithms

Yinuo Ren, Haoxuan Chen, Yuchen Zhu, Wei Guo, Yongxin Chen, Grant Rotskoff, Molei Tao, and Lexing Ying. Fast solvers for discrete diffusion models: Theory and applications of high-order algorithms. Advances in Neural Information Processing Systems, 38: 0 167228--167282, 2026

2026
[37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

2022
[38]

Simple and effective masked diffusion language models

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

2024
[39]

Simple guidance mechanisms for discrete diffusion models

Yair Schiff, Subham Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. In International Conference on Learning Representations, volume 2025, pp.\ 43776--43821, 2025

2025
[40]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

2024
[41]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.\ 2256--2265. pmlr, 2015

2015
[42]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021 a . URL https://openreview.net/forum?id=St1giarCHLP

2021
[43]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021 b . URL https://openreview.net/forum?id=PxTIG12RRHS

2021
[44]

On integral probability metrics, phi-divergences and binary classification

Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch \"o lkopf, and Gert RG Lanckriet. On integral probability metrics, phi-divergences and binary classification. arXiv preprint arXiv:0901.2698, 2009

Pith/arXiv arXiv 2009
[45]

Score-based continuous-time discrete diffusion models

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=BYWWwSY2G5s

2023
[46]

Digress: Discrete denoising diffusion for graph generation

Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=UaAD-Nu86WX

2023
[47]

Convergence of score-based discrete diffusion models: A discrete-time analysis

Zikun Zhang, Zixiang Chen, and Quanquan Gu. Convergence of score-based discrete diffusion models: A discrete-time analysis. In International Conference on Learning Representations, volume 2025, pp.\ 34747--34772, 2025

2025
[48]

Unified discrete diffusion for categorical data

Lingxiao Zhao, Xueying Ding, Lijun Yu, and Leman Akoglu. Unified discrete diffusion for categorical data. Journal of Machine Learning Research, 26 0 (215): 0 1--49, 2025

2025
[49]

Informed correctors for discrete diffusion models

Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models. Advances in Neural Information Processing Systems, 38: 0 125510--125538, 2026

2026
[50]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. In International Conference on Learning Representations, volume 2025, pp.\ 63186--63227, 2025

2025
[51]

Mdns: Masked diffusion neural sampler via stochastic optimal control

Yuchen Zhu, Wei Guo, Jaemoo Choi, Guan-Horng Liu, Yongxin Chen, and Molei Tao. Mdns: Masked diffusion neural sampler via stochastic optimal control. Advances in Neural Information Processing Systems, 38: 0 35260--35308, 2026

2026

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

On the numerical analysis of inhomogeneous continuous-time markov chains

Markus Arns, Peter Buchholz, and Andriy Panchenko. On the numerical analysis of inhomogeneous continuous-time markov chains. INFORMS Journal on Computing, 22 0 (3): 0 416--432, 2010

2010

[3] [3]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

2021

[4] [4]

Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet

Jeremiah Birrell, Paul Dupuis, Markos A. Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet. (f,gamma)-divergences: Interpolating between f-divergences and integral probability metrics. Journal of Machine Learning Research, 23 0 (39): 0 1--70, 2022. URL http://jmlr.org/papers/v23/21-0100.html

2022

[5] [5]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35: 0 28266--28279, 2022

2022

[6] [6]

Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=kQwSbv0BR4

2024

[7] [7]

The numerical stability of leaping methods for stochastic simulation of chemically reacting systems

Yang Cao, Linda R Petzold, Muruhan Rathinam, and Daniel T Gillespie. The numerical stability of leaping methods for stochastic simulation of chemically reacting systems. Journal of Chemical Physics, 121 0 (24): 0 12169--12178, 2004

2004

[8] [8]

Efficient step size selection for the tau-leaping simulation method

Yang Cao, Daniel T Gillespie, and Linda R Petzold. Efficient step size selection for the tau-leaping simulation method. The Journal of chemical physics, 124 0 (4), 2006

2006

[9] [9]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 11315--11325, 2022

2022

[10] [10]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023

Pith/arXiv arXiv 2023

[11] [11]

Convergence analysis of discrete diffusion model: Exact implementation through uniformization

Hongrui Chen and Lexing Ying. Convergence analysis of discrete diffusion model: Exact implementation through uniformization. Journal of Machine Learning, 4 0 (2): 0 108--127, June 2025. doi:10.4208/jml.240812. URL https://www.global-sci.com/index.php/jml/article/view/13211

doi:10.4208/jml.240812 2025

[12] [12]

Optimal inference schedules for masked diffusion models

Sitan Chen, Kevin Cong, and Jerry Li. Optimal inference schedules for masked diffusion models. arXiv preprint arXiv:2511.04647, 2025

arXiv 2025

[13] [13]

Fast sampling via discrete non-markov diffusion models with predetermined transition time

Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via discrete non-markov diffusion models with predetermined transition time. Advances in Neural Information Processing Systems, 37: 0 106870--106905, 2024

2024

[14] [14]

Non-asymptotic convergence of discrete diffusion models: Masked and random walk dynamics

Giovanni Conforti, Alain Durmus, Le-Tuyet-Nhi Pham, and Gael Raoul. Non-asymptotic convergence of discrete diffusion models: Masked and random walk dynamics. arXiv preprint arXiv:2512.00580, 2025

arXiv 2025

[15] [15]

Efficient sampling with discrete diffusion models: Sharp and adaptive guarantees

Daniil Dmitriev, Zhihan Huang, and Yuting Wei. Efficient sampling with discrete diffusion models: Sharp and adaptive guarantees. arXiv preprint arXiv:2602.15008, 2026

Pith/arXiv arXiv 2026

[16] [16]

Approximate accelerated stochastic simulation of chemically reacting systems

Daniel T Gillespie. Approximate accelerated stochastic simulation of chemically reacting systems. The Journal of chemical physics, 115 0 (4): 0 1716--1733, 2001

2001

[17] [17]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

2020

[18] [18]

Simulation from endpoint-conditioned, continuous-time markov chains on a finite state space, with applications to molecular evolution

Asger Hobolth and Eric A Stone. Simulation from endpoint-conditioned, continuous-time markov chains on a finite state space, with applications to molecular evolution. The annals of applied statistics, 3 0 (3): 0 1204, 2009

2009

[19] [19]

Argmax flows and multinomial diffusion: Learning categorical distributions

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr \'e , and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in neural information processing systems, 34: 0 12454--12465, 2021

2021

[20] [20]

Reversibility and stochastic networks

Frank P Kelly. Reversibility and stochastic networks. J. Wiley, 1979

1979

[21] [21]

Tabddpm: Modelling tabular data with diffusion models

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with diffusion models. In International conference on machine learning, pp.\ 17564--17579. PMLR, 2023

2023

[22] [22]

Discrete markov probabilistic models: An improved discrete score-based framework with sharp convergence bounds under minimal assumptions

PHAM Le-Tuyet-Nhi, Dario Shariatian, Antonio Ocello, Giovanni Conforti, and Alain Oliviero Durmus. Discrete markov probabilistic models: An improved discrete score-based framework with sharp convergence bounds under minimal assumptions. In Forty-second International Conference on Machine Learning, 2025

2025

[23] [23]

Markov chains and mixing times, volume 107

David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017

2017

[24] [24]

Breaking ar’s sampling bottleneck: Provable acceleration via diffusion language models

Gen Li and Changxiao Cai. Breaking ar’s sampling bottleneck: Provable acceleration via diffusion language models. Advances in Neural Information Processing Systems, 38: 0 11700--11725, 2026

2026

[25] [25]

Neural continuous-time markov chain: Discrete diffusion via decoupled jump timing and direction

Jingyuan Li, Xiaoyi Jiang, Fukang Wen, Wei Liu, Renqian Luo, Yi Zhu, Zuoqiang Shi, and Pipi Hu. Neural continuous-time markov chain: Discrete diffusion via decoupled jump timing and direction. arXiv preprint arXiv:2604.15694, 2026

Pith/arXiv arXiv 2026

[26] [26]

Absorb and converge: Provable convergence guarantee for absorbing discrete diffusion models

Yuchen Liang, Renxiang Huang, Lifeng Lai, Ness Shroff, and Yingbin Liang. Absorb and converge: Provable convergence guarantee for absorbing discrete diffusion models. Advances in Neural Information Processing Systems, 38: 0 20283--20318, 2025

2025

[27] [27]

Discrete diffusion models: Novel analysis and new sampler guarantees

Yuchen Liang, Yingbin Liang, Lifeng Lai, and Ness Shroff. Discrete diffusion models: Novel analysis and new sampler guarantees. Advances in Neural Information Processing Systems, 38: 0 165511--165548, 2026 a

2026

[28] [28]

Sharp convergence rates for masked diffusion models, 2026 b

Yuchen Liang, Zhiheng Tan, Ness Shroff, and Yingbin Liang. Sharp convergence rates for masked diffusion models, 2026 b . URL https://arxiv.org/abs/2602.22505

arXiv 2026

[29] [29]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=CNicRIVIPA

2024

[30] [30]

Concrete score matching: Generalized score matching for discrete data

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35: 0 34532--34545, 2022

2022

[31] [31]

Score-based generative models are provably robust: an uncertainty quantification perspective

Nikiforos Mimikos-Stamatopoulos, Benjamin J Zhang, and Markos A Katsoulakis. Score-based generative models are provably robust: an uncertainty quantification perspective. Advances in Neural Information Processing Systems, 37: 0 63154--63183, 2024

2024

[32] [32]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. Advances in Neural Information Processing Systems, 38: 0 50608--50646, 2026

2026

[33] [33]

Jump your steps: Optimizing sampling schedule of discrete diffusion models

Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, and Yuki Mitsufuji. Jump your steps: Optimizing sampling schedule of discrete diffusion models. In International Conference on Learning Representations, volume 2025, pp.\ 96272--96300, 2025

2025

[34] [34]

Computational optimal transport: With applications to data science

Gabriel Peyr \'e and Marco Cuturi. Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019

2019

[35] [35]

How discrete and continuous diffusion meet: Comprehensive analysis of discrete diffusion models via a stochastic integral framework

Yinuo Ren, Haoxuan Chen, Grant Rotskoff, and Lexing Ying. How discrete and continuous diffusion meet: Comprehensive analysis of discrete diffusion models via a stochastic integral framework. In International Conference on Learning Representations, volume 2025, pp.\ 42904--42941, 2025

2025

[36] [36]

Fast solvers for discrete diffusion models: Theory and applications of high-order algorithms

Yinuo Ren, Haoxuan Chen, Yuchen Zhu, Wei Guo, Yongxin Chen, Grant Rotskoff, Molei Tao, and Lexing Ying. Fast solvers for discrete diffusion models: Theory and applications of high-order algorithms. Advances in Neural Information Processing Systems, 38: 0 167228--167282, 2026

2026

[37] [37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

2022

[38] [38]

Simple and effective masked diffusion language models

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

2024

[39] [39]

Simple guidance mechanisms for discrete diffusion models

Yair Schiff, Subham Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. In International Conference on Learning Representations, volume 2025, pp.\ 43776--43821, 2025

2025

[40] [40]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

2024

[41] [41]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.\ 2256--2265. pmlr, 2015

2015

[42] [42]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021 a . URL https://openreview.net/forum?id=St1giarCHLP

2021

[43] [43]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021 b . URL https://openreview.net/forum?id=PxTIG12RRHS

2021

[44] [44]

On integral probability metrics, phi-divergences and binary classification

Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch \"o lkopf, and Gert RG Lanckriet. On integral probability metrics, phi-divergences and binary classification. arXiv preprint arXiv:0901.2698, 2009

Pith/arXiv arXiv 2009

[45] [45]

Score-based continuous-time discrete diffusion models

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=BYWWwSY2G5s

2023

[46] [46]

Digress: Discrete denoising diffusion for graph generation

Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=UaAD-Nu86WX

2023

[47] [47]

Convergence of score-based discrete diffusion models: A discrete-time analysis

Zikun Zhang, Zixiang Chen, and Quanquan Gu. Convergence of score-based discrete diffusion models: A discrete-time analysis. In International Conference on Learning Representations, volume 2025, pp.\ 34747--34772, 2025

2025

[48] [48]

Unified discrete diffusion for categorical data

Lingxiao Zhao, Xueying Ding, Lijun Yu, and Leman Akoglu. Unified discrete diffusion for categorical data. Journal of Machine Learning Research, 26 0 (215): 0 1--49, 2025

2025

[49] [49]

Informed correctors for discrete diffusion models

Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models. Advances in Neural Information Processing Systems, 38: 0 125510--125538, 2026

2026

[50] [50]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. In International Conference on Learning Representations, volume 2025, pp.\ 63186--63227, 2025

2025

[51] [51]

Mdns: Masked diffusion neural sampler via stochastic optimal control

Yuchen Zhu, Wei Guo, Jaemoo Choi, Guan-Horng Liu, Yongxin Chen, and Molei Tao. Mdns: Masked diffusion neural sampler via stochastic optimal control. Advances in Neural Information Processing Systems, 38: 0 35260--35308, 2026

2026