pith. sign in

arxiv: 2605.21484 · v1 · pith:X3EDENFZnew · submitted 2026-05-20 · 💻 cs.CV

One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

Pith reviewed 2026-05-21 04:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords discrete diffusionone-step distillationfixed-point iterationimage generationstraight-through estimatordrift lossconditional generation
0
0 comments X

The pith

Fixed-Point Distillation trains discrete diffusion students to match teacher quality using one inference step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to remove the slow iterative decoding required by discrete diffusion models for image synthesis. It does so by building training targets on the fly: the student produces a one-step draft, that draft is partially corrupted, and a single teacher step supplies the correction. These corrections are computed after lifting tokens into continuous features and accumulated with a multi-bandwidth drift loss; gradients flow back through a straight-through estimator so the student stays on the same discrete codebook used at inference. If the method works, high-fidelity class- and text-conditional images become available without the multi-step cost that currently limits these models.

Core claim

Fixed-Point Distillation constructs local correction targets by partially corrupting the student's one-step draft and refining it with one teacher step. The training objective is evaluated after lifting discrete tokens to continuous features and applying a multi-bandwidth drift loss; a straight-through estimator routes exact hard tokens to the teacher and decoder while passing continuous gradients to the student logits. An optional unconditional adversarial term can be added for perceptual quality. On both class- and text-conditional benchmarks the resulting single-step generator reaches competitive fidelity and structural alignment, closing much of the gap to the multi-step teacher and outp

What carries the argument

Fixed-Point Distillation (FPD), which generates correction targets by partially corrupting a student one-step draft and refining it with a single teacher forward pass, evaluated in continuous feature space via multi-bandwidth drift loss and back-propagated with a straight-through estimator.

If this is right

  • Single-step inference becomes competitive with the original multi-step teacher on visual fidelity and structural metrics.
  • The same end-to-end pipeline outperforms prior discrete distillation baselines on both class- and text-conditional tasks.
  • Training and inference operate on identical discrete codebook outputs because hard tokens are used in the forward pass.
  • An optional adversarial objective can be grafted onto the framework to further improve perceptual realism without changing the core distillation loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The corruption-and-refine pattern may transfer to other discrete-token generative pipelines that currently require many denoising steps.
  • Because the correction targets are built from the student's own drafts, the method could be applied iteratively to improve even the one-step student further.
  • The continuous-feature drift loss may offer a general way to supervise discrete generators when direct token-level supervision is unstable.

Load-bearing premise

Lifting discrete tokens to continuous features and measuring drift on partially corrupted drafts yields correction signals that remain valid once the student is forced back onto the final discrete token manifold.

What would settle it

A direct head-to-head evaluation on standard benchmarks showing that the single-step FPD student produces visibly lower fidelity or weaker structural alignment than the multi-step teacher would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.21484 by Chaoyang Wang, Yunhai Tong.

Figure 1
Figure 1. Figure 1: Text-to-image samples from MaskGen-L distilled with our Fixed-Point Distillation (FPD) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed distillation framework. The student model processes a masked initialization zinit to output logits ℓθ. Via a Straight-Through Estimator (STE), these discrete outputs are decoded into a continuous image xθ and mapped to a feature space by a frozen backbone Φ. Concurrently, the student’s token draft is partially re-masked (Mr) and refined by the frozen teacher. The decoded teacher fe… view at source ↗
Figure 3
Figure 3. Figure 3: Training strategy comparisons. We compare the multi-step teacher baselines with our single [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student's one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Fixed-Point Distillation (FPD), an end-to-end one-step distillation framework for discrete diffusion image generators. It constructs local correction targets by partially corrupting the student's one-step draft and refining it via a single teacher step; discrete tokens are lifted to continuous features where a multi-bandwidth drift loss iteratively accumulates corrections; a straight-through estimator routes hard tokens to the teacher/decoder in the forward pass while passing continuous gradients back to the student; an optional unconditional adversarial objective is included for perceptual quality. On class- and text-conditional generation tasks the method is claimed to achieve competitive visual fidelity and structural alignment in a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.

Significance. If the central claims hold, FPD would supply a compact, fully differentiable pathway to single-step discrete diffusion sampling that avoids auxiliary score networks and multi-stage pipelines, offering a practical route to fast, high-fidelity image synthesis from discrete diffusion models.

major comments (2)
  1. [Abstract] Abstract: performance claims are stated without any quantitative tables, ablation results, or error analysis, so the effectiveness of the drift loss and straight-through estimator rests on unverified assertions.
  2. [Method] Method description of the drift loss and STE pathway: the assumption that multi-bandwidth drift-loss targets computed on lifted continuous features from partially corrupted student drafts produce gradients that, after STE routing of hard tokens, train the student to match the teacher's one-step discrete distribution is load-bearing for the central claim but lacks any analysis of projection error or alignment between continuous and discrete manifolds; if the continuous corrections fail to survive the discrete projection, the student can converge to a different fixed point than the teacher.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have made targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: performance claims are stated without any quantitative tables, ablation results, or error analysis, so the effectiveness of the drift loss and straight-through estimator rests on unverified assertions.

    Authors: We agree that the abstract would benefit from explicit quantitative support. The body of the manuscript already contains comprehensive tables (Section 4) reporting FID, CLIP-score, and structural metrics on both class- and text-conditional benchmarks, together with ablations isolating the drift loss and STE. We have revised the abstract to include the key headline numbers and a brief reference to the ablation results, so that the performance claims are immediately grounded. revision: yes

  2. Referee: [Method] Method description of the drift loss and STE pathway: the assumption that multi-bandwidth drift-loss targets computed on lifted continuous features from partially corrupted student drafts produce gradients that, after STE routing of hard tokens, train the student to match the teacher's one-step discrete distribution is load-bearing for the central claim but lacks any analysis of projection error or alignment between continuous and discrete manifolds; if the continuous corrections fail to survive the discrete projection, the student can converge to a different fixed point than the teacher.

    Authors: We appreciate this observation on the critical interface between continuous loss and discrete projection. The STE is deliberately constructed so that the forward pass always consumes exact hard tokens (identical to inference), while the multi-bandwidth drift loss supplies gradients in the lifted feature space. Because the lifting uses the same codebook embeddings employed by the teacher, the continuous corrections remain semantically aligned with the discrete manifold. We have added a short subsection (Section 3.3) that (i) formalizes the projection step, (ii) bounds the expected token-level discrepancy under the STE, and (iii) reports an auxiliary metric of student-teacher token agreement across training. These additions directly address the risk of converging to an unintended fixed point. revision: partial

Circularity Check

0 steps flagged

No significant circularity; explicit algorithmic construction with empirical validation

full rationale

The paper presents Fixed-Point Distillation (FPD) as an end-to-end training framework that explicitly constructs correction targets via partial corruption of student drafts, single teacher refinement, lifting to continuous features, multi-bandwidth drift loss, and straight-through estimation. These are procedural choices in the method rather than derivations that reduce by construction to fitted inputs or self-referential predictions. The central claims rest on empirical comparisons to baselines on class- and text-conditional tasks, without load-bearing self-citations or uniqueness theorems imported from prior author work. The derivation chain is self-contained against external benchmarks and does not equate outputs to inputs via definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Because only the abstract is available, the ledger is necessarily incomplete. The method appears to introduce at least one modeling choice (lifting to continuous features) and one training hyperparameter family (multi-bandwidths) whose values are not justified from first principles in the provided text.

free parameters (2)
  • multi-bandwidth values
    The abstract invokes a multi-bandwidth drift loss without stating how the bandwidths are chosen or whether they are fitted to data.
  • corruption schedule
    Partial corruption of the student draft is central to constructing targets; the exact noise level or schedule is unspecified.
axioms (1)
  • domain assumption Straight-through estimator preserves gradient signal across the discrete bottleneck without introducing bias that harms final discrete outputs.
    Invoked to enable backpropagation while keeping training and inference on the same codebook manifold.

pith-pipeline@v0.9.0 · 5743 in / 1385 out tokens · 27509 ms · 2026-05-21T04:34:41.467002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 18 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  3. [3]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  4. [4]

    A pytorch reproduction of masked generative image transformer.arXiv preprint arXiv:2310.14400, 2023

    Victor Besnier and Mickael Chen. A pytorch reproduction of masked generative image transformer.arXiv preprint arXiv:2310.14400, 2023

  5. [5]

    Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes

    Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P Breckon, and Chris G Willcocks. Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. InEuropean Conference on Computer Vision, pages 170–188. Springer, 2022

  6. [6]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  7. [7]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  8. [8]

    A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

    Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

  9. [9]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

  10. [10]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  11. [11]

    Generative Modeling via Drifting

    Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

  12. [12]

    Beyond autoregression: Fast llms via self-distillation through time.arXiv preprint arXiv:2410.21035, 2024

    Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time.arXiv preprint arXiv:2410.21035, 2024

  13. [13]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  14. [14]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

  15. [15]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  16. [16]

    Vector quantized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10696–10706, 2022

  17. [17]

    Distillation of discrete diffusion through dimensional correlations.arXiv preprint arXiv:2410.08709, 2024

    Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distillation of discrete diffusion through dimensional correlations.arXiv preprint arXiv:2410.08709, 2024

  18. [18]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  19. [19]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  20. [20]

    Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

  21. [21]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  22. [22]

    Beyond single tokens: Distilling discrete diffusion models via discrete mmd.arXiv preprint arXiv:2603.20155, 2026

    Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, and Tim Salimans. Beyond single tokens: Distilling discrete diffusion models via discrete mmd.arXiv preprint arXiv:2603.20155, 2026

  23. [23]

    Unified discrete diffusion for simultaneous vision-language generation

    Minghui Hu, Chuanxia Zheng, Heliang Zheng, Tat-Jen Cham, Chaoyue Wang, Zuopeng Yang, Dacheng Tao, and Ponnuthurai N Suganthan. Unified discrete diffusion for simultaneous vision-language generation. arXiv preprint arXiv:2211.14842, 2022. 10

  24. [24]

    Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens

    Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens. InICCV, 2025

  25. [25]

    Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

  26. [26]

    Reward guided latent consistency distillation.arXiv preprint arXiv:2403.11027, 2024

    Jiachen Li, Weixi Feng, Wenhu Chen, and William Yang Wang. Reward guided latent consistency distillation.arXiv preprint arXiv:2403.11027, 2024

  27. [27]

    Instaflow: One step is enough for high- quality diffusion-based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high- quality diffusion-based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023

  28. [28]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

  29. [29]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024

  30. [30]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  31. [31]

    Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546, 2023

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546, 2023

  32. [32]

    Learning few-step diffusion models by trajectory distribution matching

    Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17719–17728, 2025

  33. [33]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

  34. [34]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

  35. [35]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  36. [36]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022

  37. [37]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

  38. [38]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

  39. [39]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  40. [40]

    Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  41. [41]

    Simple and effective masked diffusion language models

    Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  42. [42]

    Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

  43. [43]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022

  44. [44]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  45. [45]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

  46. [46]

    Stylegan-xl: Scaling stylegan to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022

  47. [47]

    Laion-aesthetics.LAION

    Christoph Schuhmann and Romain Beaumont. Laion-aesthetics.LAION. AI, 4, 2022. 11

  48. [48]

    Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

  49. [49]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  50. [50]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  51. [51]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252, 2023

  52. [52]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  53. [53]

    Score-based continuous-time discrete diffusion models

    Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models.arXiv preprint arXiv:2211.16750, 2022

  54. [54]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

  55. [55]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

  56. [56]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  57. [57]

    Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025

    Jaehoon Yoo, Wonjung Kim, and Seunghoon Hong. Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025

  58. [58]

    Guided score identity distillation for data-free one-step text-to-image generation

    Mingyuan Zhou, Zhendong Wang, Huangjie Zheng, and Hai Huang. Long and short guidance in score identity distillation for one-step text-to-image generation.arXiv preprint arXiv:2406.01561, 3, 2024

  59. [59]

    Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

    Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In Forty-first International Conference on Machine Learning, 2024

  60. [60]

    Di [m] o: Distilling masked diffusion models into one-step generator

    Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Di [m] o: Distilling masked diffusion models into one-step generator. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18606–18618, 2025

  61. [61]

    Soft-di [m] o: Improving one-step discrete image generation with soft embeddings.arXiv preprint arXiv:2509.22925, 2025

    Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Soft-di [m] o: Improving one-step discrete image generation with soft embeddings.arXiv preprint arXiv:2509.22925, 2025. 12