Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Cheng Zhang; Jiacheng Sun; Shuchen Xue; Tianyang Hu; Tianyu Xie; Zhenguo Li; Zijin Feng

arxiv: 2505.17384 · v2 · submitted 2025-05-23 · 💻 cs.LG · cs.CV· stat.ML

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Tianyu Xie , Shuchen Xue , Zijin Feng , Tianyang Hu , Jiacheng Sun , Zhenguo Li , Cheng Zhang This is my paper

Pith reviewed 2026-05-19 14:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords discrete diffusionmasked diffusion modelsvariational autoencodinglatent variable modelingdimensional correlationsdenoising stepssample qualitygenerative models

0 comments

The pith

VADD adds latent variable modeling to masked diffusion to capture dimensional correlations and raise quality with few denoising steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Variational Autoencoding Discrete Diffusion to fix a key weakness in masked diffusion models. Standard MDMs progressively unmask dimensions from a fully masked state but miss many inter-dimensional dependencies when the number of steps is small, which hurts sample quality. VADD inserts an auxiliary recognition model that treats the diffusion process as a latent variable model and trains it by maximizing a variational lower bound. This supplies amortized inference across the training set and lets the model learn correlations implicitly. The result keeps the original speed of MDMs while delivering measurably better samples on toy data, images, and text.

Core claim

We propose Variational Autoencoding Discrete Diffusion (VADD), a framework that augments discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small.

What carries the argument

The auxiliary recognition model that performs amortized inference by learning inter-dimensional correlations through latent variables under the variational lower bound.

If this is right

Sample quality improves on 2D toy data when the number of denoising steps is kept small.
Pixel-level image generation yields higher-quality outputs than MDM baselines under limited steps.
Text generation tasks show consistent gains in sample quality with the same efficiency.
The method works across multiple discrete data domains while preserving the original generation speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-variable trick could be tested on other discrete generative processes that suffer from short-horizon dependency problems.
If the gains hold, practitioners could reduce the number of denoising steps in production pipelines without losing output fidelity.
The approach opens a route for combining variational autoencoding with future variants of discrete diffusion that currently lack correlation modeling.

Load-bearing premise

The auxiliary recognition model can be trained stably via variational lower bounds and will implicitly capture the necessary inter-dimensional correlations without explicit modeling or additional regularization.

What would settle it

Training VADD on pixel-level image data and measuring no improvement in sample quality metrics such as FID when using only ten or fewer denoising steps compared with a matched MDM baseline would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2505.17384 by Cheng Zhang, Jiacheng Sun, Shuchen Xue, Tianyang Hu, Tianyu Xie, Zhenguo Li, Zijin Feng.

**Figure 2.** Figure 2: The network architecture of the denoising model and recognition model in VADD for text [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Non-cherry-picked samples generated by different discrete diffusion models and sampling [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Generative perplexities (↓) evaluated by a pre-trained GPT-2 large model based on 256 samples on OpenWebText. All model sizes correspond to GPT-2 small [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Histplots of the ground truth and the samples generated from different models and sampling [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Histplots of the ground truth and the samples generated from VADD with various sampling [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Samples generated by VADD with different sampling steps on the binarized MNIST. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Training and test negative DELBOs on the pixel-level image generation task. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: CIFAR-10 samples generated from VADD with various sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Text sample generated by VADD trained on OpenWebText with 128 sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Text sample generated by VADD trained on OpenWebText with 256 sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Text sample generated by VADD trained on OpenWebText with 512 sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Text sample generated by VADD trained on OpenWebText with 1024 sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

read the original abstract

Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines in sample quality with few denoising steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VADD adds a variational recognition model to masked diffusion to target inter-dimension correlations, with reported gains at low step counts, but the mechanism lacks direct evidence.

read the letter

The core of this paper is a straightforward extension: take a masked diffusion model, add an auxiliary recognition network, and train the whole thing with a variational lower bound so the latent variables can pick up correlations that plain MDMs miss when unmasking happens in few steps. The claim is that this keeps the speed of MDMs while lifting sample quality on toy data, images, and text, especially in the low-step regime that matters for practical generation.

Referee Report

3 major / 2 minor

Summary. The paper proposes Variational Autoencoding Discrete Diffusion (VADD), which augments masked diffusion models (MDMs) with an auxiliary recognition model trained via variational lower-bound maximization. The framework is intended to implicitly capture inter-dimensional correlations in discrete data, thereby improving sample quality relative to standard MDMs especially when the number of denoising steps is small, while preserving generation efficiency. Empirical support is claimed on 2D toy data, pixel-level image generation, and text generation tasks.

Significance. If the reported gains are shown to arise specifically from correlation modeling rather than added capacity and if the recognition model training is demonstrated to be stable, the approach could offer a practical route to high-quality few-step discrete generation. This would address a recognized limitation of current MDMs without sacrificing their computational advantages.

major comments (3)

[Abstract] Abstract: the central claim that the auxiliary recognition model 'implicitly capture[s] correlations among dimensions' rests on the unverified assumption that variational lower-bound training alone will encode cross-dimension dependencies; no analysis, visualization of latent representations, or ablation isolating correlation capture versus capacity increase is referenced.
[Method] Method (variational training description): the manuscript provides no derivation or stability analysis for the amortized inference procedure; without explicit regularization, structured priors, or architectural bias toward joint modeling, the latent variables may collapse to per-dimension factors, undermining the few-step quality advantage.
[Experiments] Experiments: no error bars, statistical significance tests, or ablation controls (e.g., comparison against an MDM with matched parameter count but no latent variables) are mentioned; the reported outperformance on toy data, images, and text therefore cannot yet be attributed to the claimed correlation modeling.

minor comments (2)

[Abstract] Abstract: the phrase 'amortized inference over the training set' is used without a brief clarifying sentence on how it differs from standard variational inference in diffusion contexts.
[Method] Notation: the distinction between the diffusion process variables and the latent variables introduced by the recognition model should be made explicit in the first equation block where both appear.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where additional analysis and controls would strengthen the presentation of VADD's benefits for modeling inter-dimensional correlations in discrete diffusion. We address each major comment below and describe the revisions planned for the next version of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the auxiliary recognition model 'implicitly capture[s] correlations among dimensions' rests on the unverified assumption that variational lower-bound training alone will encode cross-dimension dependencies; no analysis, visualization of latent representations, or ablation isolating correlation capture versus capacity increase is referenced.

Authors: We agree that the current manuscript would benefit from explicit verification of the correlation-capture mechanism. In the revised version we will add visualizations of the learned latent representations (e.g., t-SNE projections or correlation matrices between latent dimensions and data dimensions) and an ablation that compares VADD against an MDM whose capacity has been increased to match the total parameter count of VADD but without the latent-variable component. These additions will help isolate the contribution of the variational modeling from mere capacity gains. revision: yes
Referee: [Method] Method (variational training description): the manuscript provides no derivation or stability analysis for the amortized inference procedure; without explicit regularization, structured priors, or architectural bias toward joint modeling, the latent variables may collapse to per-dimension factors, undermining the few-step quality advantage.

Authors: We will include a concise derivation of the evidence lower bound (ELBO) for the joint training of the diffusion model and the recognition network in the revised Method section. While our empirical results on toy data, images, and text show consistent gains at low step counts without observable collapse, we acknowledge that a dedicated stability analysis is currently absent. In the revision we will add a short discussion of the architectural choices (shared encoder across dimensions and joint reconstruction objective) that encourage joint modeling, and we will report training curves to demonstrate stability. If needed, we can incorporate a simple KL-regularization term in future experiments. revision: partial
Referee: [Experiments] Experiments: no error bars, statistical significance tests, or ablation controls (e.g., comparison against an MDM with matched parameter count but no latent variables) are mentioned; the reported outperformance on toy data, images, and text therefore cannot yet be attributed to the claimed correlation modeling.

Authors: We accept this criticism. The revised manuscript will report all quantitative results with error bars computed over at least five independent runs, include paired statistical significance tests (e.g., Wilcoxon signed-rank), and add the suggested capacity-matched ablation. These changes will allow readers to more confidently attribute performance differences to the latent-variable component rather than to increased model capacity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core derivation introduces an auxiliary recognition model trained by maximizing a variational lower bound to implicitly capture inter-dimensional correlations in masked diffusion models. This relies on standard external variational inference techniques (ELBO maximization) rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. The modeling choice is presented as an architectural augmentation with amortized inference, and performance claims are tied to empirical results on toy data, images, and text rather than reducing by construction to the inputs. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a circular manner from the abstract and framework description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that variational inference over an auxiliary recognition model will stably capture dimensional correlations without explicit terms or post-hoc adjustments.

axioms (1)

domain assumption Variational lower bound maximization yields stable training for the joint diffusion and recognition model.
Invoked implicitly when stating that the auxiliary model enables stable training via variational lower bounds.

pith-pipeline@v0.9.0 · 5710 in / 1147 out tokens · 38206 ms · 2026-05-19T14:16:25.502309+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we define it as a latent variable model: p_θ(xs|xt) = ∫ p_θ(xs|xt,z) p(z) dz ... auxiliary recognition model r_ϕ(z|x0,xt) ≈ p_θ(z|x0,xt) ... bL_λ(x0;θ,ϕ) ... KL annealing weight
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the recognition model r_ϕ(z|x0,xt) is a diagonal Gaussian distribution ... reparameterization trick ... posterior collapse issue

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

Learned Relay Representations enable masked diffusion models to propagate useful latent information across denoising steps, scaling to Fast-dLLM v2 to outperform supervised finetuning on coding tasks while cutting inf...
Infinite Mask Diffusion for Few-Step Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
Unifying Masked Diffusion Models with Various Generation Orders and Beyond
cs.LG 2026-02 unverdicted novelty 7.0

OeMDM unifies masked diffusion, autoregressive, and block diffusion models under various generation orders; LoMDM jointly optimizes ordering and diffusion backbone from scratch and outperforms prior discrete diffusion...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 3 Pith papers · 4 internal anchors

[1]

Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis

A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22563– 22575,

work page 2023
[2]

Generating Sentences from a Continuous Space

Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space.arXiv preprint arXiv:1511.06349,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Distilla- tion of discrete diffusion through dimensional correlations

Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distilla- tion of discrete diffusion through dimensional correlations. InWorkshop on Machine Learning and Compression, NeurIPS 2024,

work page 2024
[5]

Diffusionbert: Improving generative masked language models with diffusion models,

URL https://openreview.net/forum?id= ibxO5X7kxc. Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusion- bert: Improving generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029,

work page arXiv
[6]

Jordan, Zoubin Ghahramani, T

11 Published as a conference paper at ICLR 2026 Michael I. Jordan, Zoubin Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models.Machine Learning, 37:183–233,

work page 2026
[7]

Scalable language models with posterior inference of latent thought vectors.arXiv preprint arXiv:2502.01567,

Deqian Kong, Minglu Zhao, Dehong Xu, Bo Pang, Shu Wang, Edouardo Honig, Zhangzhang Si, Chuan Li, Jianwen Xie, Sirui Xie, et al. Scalable language models with posterior inference of latent thought vectors.arXiv preprint arXiv:2502.01567,

work page arXiv
[8]

Discrete copula diffusion

Anji Liu, Oliver Broadrick, Mathias Niepert, and Guy Van den Broeck. Discrete copula diffusion. arXiv preprint arXiv:2410.01949,

work page arXiv
[9]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.ArXiv, abs/2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Simplified and generalized masked diffusion for discrete data.Advances in Neural Information Processing Systems, 37: 103131–103167,

12 Published as a conference paper at ICLR 2026 Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in Neural Information Processing Systems, 37: 103131–103167,

work page 2026
[11]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[12]

Energy-based diffusion language models for text generation.arXiv preprint arXiv:2410.21357,

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-based diffusion language models for text generation.arXiv preprint arXiv:2410.21357,

work page arXiv
[13]

Soft-di [m] o: Improving one-step discrete image generation with soft embeddings.arXiv preprint arXiv:2509.22925, 2025

URL https://openreview.net/forum?id=CTC7CmirNr. Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Soft-di[m]o: Improving one-step discrete image generation with soft embeddings.ArXiv, abs/2509.22925, 2025a. Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Di [M]o: Distilling masked diffusion models into one-step generator....

work page arXiv 2026
[14]

can be directly obtained in its original form without any transformation. The denoising model and the recognition model adopt the UNet architecture in the PyTorch imple- mentation (https://github.com/addtt/variational-diffusion-models) of vari- ational diffusion models (Kingma et al., 2021). • Binarized MNIST. For both the recognition model and the denois...

work page 2021
[15]

15 Published as a conference paper at ICLR 2026 • Denoising model

We make the following modifications for the denoising model and the recognition model. 15 Published as a conference paper at ICLR 2026 • Denoising model. A special token [M] is also added to the table of the pixel embedding module. We transform the latent variable z into an embedding emb(z) with MLPs and add it to the embeddings of all pixels in each up b...

work page 2026
[16]

B.3 TEXT GENERATION For the One Billion Word dataset, we firstly detokenize the texts following Lou et al. (2024). We then tokenize the texts with the bert-base-uncased tokenizer, following He et al. (2022); Lou et al. (2024). We pad and truncate the sequences to a length of

work page 2024
[17]

For the OpenWebText dataset, we firstly detokenize the text following Lou et al. (2024). We then tokenize the texts with the GPT-2 tokenizer. We concatenate and wrap them to a sequence length of 1024, including a BOS and a EOS token as the first and last token of the sequence. We use the last 100K documents as the validation split. The network architectur...

work page 2024
[18]

As shown in Table 7, V ADD consistently outperforms the continuous diffusion baseline DDPM across all sampling steps on CIFAR-10

baseline, which is implemented by the diffusers.DDPMPipeline module with the google/ddpm-cifar10-32 pre-trained checkpoint. As shown in Table 7, V ADD consistently outperforms the continuous diffusion baseline DDPM across all sampling steps on CIFAR-10. At 10 steps, V ADD achieves an FID of 170.3 com- pared to 298.7 for DDPM, demonstrating a 43% improveme...

work page 2026
[19]

17 Published as a conference paper at ICLR 2026 VADD (T =

We see that the ELBO estimate gradually improves asKincreases. 17 Published as a conference paper at ICLR 2026 VADD (T =

work page 2026
[20]

them economics

Figure 7: Samples generated by V ADD with different sampling steps on the binarized MNIST. Table 7: FID score (↓) comparison between V ADD and DDPM with different sampling stepsT on the CIFAR-10 dataset. V ADD consistently achieves lower FID scores across all sampling steps. FID is computed with 50K images using theclean-fidpackage. T 10 20 30 40 50 100 V...

work page 2025
[21]

Model Perplexity AR (reprduced by Sahoo et al

All model sizes correspond to GPT-2 small. Model Perplexity AR (reprduced by Sahoo et al. (2024)) 17.54 SEDD (reproduced by Sahoo et al. (2024)) 24.10 MDLM (reported by Sahoo et al. (2024)) 23.21 MDLM (reproduced by us) 23.07 V ADD (K=

work page 2024

[1] [1]

Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis

A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22563– 22575,

work page 2023

[2] [2]

Generating Sentences from a Continuous Space

Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space.arXiv preprint arXiv:1511.06349,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Distilla- tion of discrete diffusion through dimensional correlations

Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distilla- tion of discrete diffusion through dimensional correlations. InWorkshop on Machine Learning and Compression, NeurIPS 2024,

work page 2024

[5] [5]

Diffusionbert: Improving generative masked language models with diffusion models,

URL https://openreview.net/forum?id= ibxO5X7kxc. Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusion- bert: Improving generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029,

work page arXiv

[6] [6]

Jordan, Zoubin Ghahramani, T

11 Published as a conference paper at ICLR 2026 Michael I. Jordan, Zoubin Ghahramani, T. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models.Machine Learning, 37:183–233,

work page 2026

[7] [7]

Scalable language models with posterior inference of latent thought vectors.arXiv preprint arXiv:2502.01567,

Deqian Kong, Minglu Zhao, Dehong Xu, Bo Pang, Shu Wang, Edouardo Honig, Zhangzhang Si, Chuan Li, Jianwen Xie, Sirui Xie, et al. Scalable language models with posterior inference of latent thought vectors.arXiv preprint arXiv:2502.01567,

work page arXiv

[8] [8]

Discrete copula diffusion

Anji Liu, Oliver Broadrick, Mathias Niepert, and Guy Van den Broeck. Discrete copula diffusion. arXiv preprint arXiv:2410.01949,

work page arXiv

[9] [9]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.ArXiv, abs/2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Simplified and generalized masked diffusion for discrete data.Advances in Neural Information Processing Systems, 37: 103131–103167,

12 Published as a conference paper at ICLR 2026 Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in Neural Information Processing Systems, 37: 103131–103167,

work page 2026

[11] [11]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011

[12] [12]

Energy-based diffusion language models for text generation.arXiv preprint arXiv:2410.21357,

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-based diffusion language models for text generation.arXiv preprint arXiv:2410.21357,

work page arXiv

[13] [13]

Soft-di [m] o: Improving one-step discrete image generation with soft embeddings.arXiv preprint arXiv:2509.22925, 2025

URL https://openreview.net/forum?id=CTC7CmirNr. Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Soft-di[m]o: Improving one-step discrete image generation with soft embeddings.ArXiv, abs/2509.22925, 2025a. Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Di [M]o: Distilling masked diffusion models into one-step generator....

work page arXiv 2026

[14] [14]

can be directly obtained in its original form without any transformation. The denoising model and the recognition model adopt the UNet architecture in the PyTorch imple- mentation (https://github.com/addtt/variational-diffusion-models) of vari- ational diffusion models (Kingma et al., 2021). • Binarized MNIST. For both the recognition model and the denois...

work page 2021

[15] [15]

15 Published as a conference paper at ICLR 2026 • Denoising model

We make the following modifications for the denoising model and the recognition model. 15 Published as a conference paper at ICLR 2026 • Denoising model. A special token [M] is also added to the table of the pixel embedding module. We transform the latent variable z into an embedding emb(z) with MLPs and add it to the embeddings of all pixels in each up b...

work page 2026

[16] [16]

B.3 TEXT GENERATION For the One Billion Word dataset, we firstly detokenize the texts following Lou et al. (2024). We then tokenize the texts with the bert-base-uncased tokenizer, following He et al. (2022); Lou et al. (2024). We pad and truncate the sequences to a length of

work page 2024

[17] [17]

For the OpenWebText dataset, we firstly detokenize the text following Lou et al. (2024). We then tokenize the texts with the GPT-2 tokenizer. We concatenate and wrap them to a sequence length of 1024, including a BOS and a EOS token as the first and last token of the sequence. We use the last 100K documents as the validation split. The network architectur...

work page 2024

[18] [18]

As shown in Table 7, V ADD consistently outperforms the continuous diffusion baseline DDPM across all sampling steps on CIFAR-10

baseline, which is implemented by the diffusers.DDPMPipeline module with the google/ddpm-cifar10-32 pre-trained checkpoint. As shown in Table 7, V ADD consistently outperforms the continuous diffusion baseline DDPM across all sampling steps on CIFAR-10. At 10 steps, V ADD achieves an FID of 170.3 com- pared to 298.7 for DDPM, demonstrating a 43% improveme...

work page 2026

[19] [19]

17 Published as a conference paper at ICLR 2026 VADD (T =

We see that the ELBO estimate gradually improves asKincreases. 17 Published as a conference paper at ICLR 2026 VADD (T =

work page 2026

[20] [20]

them economics

Figure 7: Samples generated by V ADD with different sampling steps on the binarized MNIST. Table 7: FID score (↓) comparison between V ADD and DDPM with different sampling stepsT on the CIFAR-10 dataset. V ADD consistently achieves lower FID scores across all sampling steps. FID is computed with 50K images using theclean-fidpackage. T 10 20 30 40 50 100 V...

work page 2025

[21] [21]

Model Perplexity AR (reprduced by Sahoo et al

All model sizes correspond to GPT-2 small. Model Perplexity AR (reprduced by Sahoo et al. (2024)) 17.54 SEDD (reproduced by Sahoo et al. (2024)) 24.10 MDLM (reported by Sahoo et al. (2024)) 23.21 MDLM (reproduced by us) 23.07 V ADD (K=

work page 2024