One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration
Pith reviewed 2026-05-21 04:34 UTC · model grok-4.3
The pith
Fixed-Point Distillation trains discrete diffusion students to match teacher quality using one inference step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fixed-Point Distillation constructs local correction targets by partially corrupting the student's one-step draft and refining it with one teacher step. The training objective is evaluated after lifting discrete tokens to continuous features and applying a multi-bandwidth drift loss; a straight-through estimator routes exact hard tokens to the teacher and decoder while passing continuous gradients to the student logits. An optional unconditional adversarial term can be added for perceptual quality. On both class- and text-conditional benchmarks the resulting single-step generator reaches competitive fidelity and structural alignment, closing much of the gap to the multi-step teacher and outp
What carries the argument
Fixed-Point Distillation (FPD), which generates correction targets by partially corrupting a student one-step draft and refining it with a single teacher forward pass, evaluated in continuous feature space via multi-bandwidth drift loss and back-propagated with a straight-through estimator.
If this is right
- Single-step inference becomes competitive with the original multi-step teacher on visual fidelity and structural metrics.
- The same end-to-end pipeline outperforms prior discrete distillation baselines on both class- and text-conditional tasks.
- Training and inference operate on identical discrete codebook outputs because hard tokens are used in the forward pass.
- An optional adversarial objective can be grafted onto the framework to further improve perceptual realism without changing the core distillation loop.
Where Pith is reading between the lines
- The corruption-and-refine pattern may transfer to other discrete-token generative pipelines that currently require many denoising steps.
- Because the correction targets are built from the student's own drafts, the method could be applied iteratively to improve even the one-step student further.
- The continuous-feature drift loss may offer a general way to supervise discrete generators when direct token-level supervision is unstable.
Load-bearing premise
Lifting discrete tokens to continuous features and measuring drift on partially corrupted drafts yields correction signals that remain valid once the student is forced back onto the final discrete token manifold.
What would settle it
A direct head-to-head evaluation on standard benchmarks showing that the single-step FPD student produces visibly lower fidelity or weaker structural alignment than the multi-step teacher would falsify the central performance claim.
Figures
read the original abstract
Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student's one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Fixed-Point Distillation (FPD), an end-to-end one-step distillation framework for discrete diffusion image generators. It constructs local correction targets by partially corrupting the student's one-step draft and refining it via a single teacher step; discrete tokens are lifted to continuous features where a multi-bandwidth drift loss iteratively accumulates corrections; a straight-through estimator routes hard tokens to the teacher/decoder in the forward pass while passing continuous gradients back to the student; an optional unconditional adversarial objective is included for perceptual quality. On class- and text-conditional generation tasks the method is claimed to achieve competitive visual fidelity and structural alignment in a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.
Significance. If the central claims hold, FPD would supply a compact, fully differentiable pathway to single-step discrete diffusion sampling that avoids auxiliary score networks and multi-stage pipelines, offering a practical route to fast, high-fidelity image synthesis from discrete diffusion models.
major comments (2)
- [Abstract] Abstract: performance claims are stated without any quantitative tables, ablation results, or error analysis, so the effectiveness of the drift loss and straight-through estimator rests on unverified assertions.
- [Method] Method description of the drift loss and STE pathway: the assumption that multi-bandwidth drift-loss targets computed on lifted continuous features from partially corrupted student drafts produce gradients that, after STE routing of hard tokens, train the student to match the teacher's one-step discrete distribution is load-bearing for the central claim but lacks any analysis of projection error or alignment between continuous and discrete manifolds; if the continuous corrections fail to survive the discrete projection, the student can converge to a different fixed point than the teacher.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have made targeted revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance claims are stated without any quantitative tables, ablation results, or error analysis, so the effectiveness of the drift loss and straight-through estimator rests on unverified assertions.
Authors: We agree that the abstract would benefit from explicit quantitative support. The body of the manuscript already contains comprehensive tables (Section 4) reporting FID, CLIP-score, and structural metrics on both class- and text-conditional benchmarks, together with ablations isolating the drift loss and STE. We have revised the abstract to include the key headline numbers and a brief reference to the ablation results, so that the performance claims are immediately grounded. revision: yes
-
Referee: [Method] Method description of the drift loss and STE pathway: the assumption that multi-bandwidth drift-loss targets computed on lifted continuous features from partially corrupted student drafts produce gradients that, after STE routing of hard tokens, train the student to match the teacher's one-step discrete distribution is load-bearing for the central claim but lacks any analysis of projection error or alignment between continuous and discrete manifolds; if the continuous corrections fail to survive the discrete projection, the student can converge to a different fixed point than the teacher.
Authors: We appreciate this observation on the critical interface between continuous loss and discrete projection. The STE is deliberately constructed so that the forward pass always consumes exact hard tokens (identical to inference), while the multi-bandwidth drift loss supplies gradients in the lifted feature space. Because the lifting uses the same codebook embeddings employed by the teacher, the continuous corrections remain semantically aligned with the discrete manifold. We have added a short subsection (Section 3.3) that (i) formalizes the projection step, (ii) bounds the expected token-level discrepancy under the STE, and (iii) reports an auxiliary metric of student-teacher token agreement across training. These additions directly address the risk of converging to an unintended fixed point. revision: partial
Circularity Check
No significant circularity; explicit algorithmic construction with empirical validation
full rationale
The paper presents Fixed-Point Distillation (FPD) as an end-to-end training framework that explicitly constructs correction targets via partial corruption of student drafts, single teacher refinement, lifting to continuous features, multi-bandwidth drift loss, and straight-through estimation. These are procedural choices in the method rather than derivations that reduce by construction to fitted inputs or self-referential predictions. The central claims rest on empirical comparisons to baselines on class- and text-conditional tasks, without load-bearing self-citations or uniqueness theorems imported from prior author work. The derivation chain is self-contained against external benchmarks and does not equate outputs to inputs via definitional equivalence.
Axiom & Free-Parameter Ledger
free parameters (2)
- multi-bandwidth values
- corruption schedule
axioms (1)
- domain assumption Straight-through estimator preserves gradient signal across the discrete bottleneck without introducing bias that harms final discrete outputs.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
constructs local correction targets by partially corrupting the student’s one-step draft and refining it with a single teacher step... lift discrete tokens into continuous features and apply a multi-bandwidth drift loss
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
straight-through estimator that feeds exact hard-sampled tokens... ensuring that training and inference operate on the same codebook manifold
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
work page 2021
-
[3]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[4]
A pytorch reproduction of masked generative image transformer.arXiv preprint arXiv:2310.14400, 2023
Victor Besnier and Mickael Chen. A pytorch reproduction of masked generative image transformer.arXiv preprint arXiv:2310.14400, 2023
-
[5]
Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P Breckon, and Chris G Willcocks. Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. InEuropean Conference on Computer Vision, pages 170–188. Springer, 2022
work page 2022
-
[6]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023
work page 2023
-
[7]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[8]
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022
work page 2022
-
[9]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022
work page 2022
-
[10]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[11]
Generative Modeling via Drifting
Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time.arXiv preprint arXiv:2410.21035, 2024
-
[13]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[14]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
work page 2023
-
[16]
Vector quantized diffusion model for text-to-image synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10696–10706, 2022
work page 2022
-
[17]
Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distillation of discrete diffusion through dimensional correlations.arXiv preprint arXiv:2410.08709, 2024
-
[18]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[19]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[20]
Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
work page 2022
-
[21]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, and Tim Salimans. Beyond single tokens: Distilling discrete diffusion models via discrete mmd.arXiv preprint arXiv:2603.20155, 2026
-
[23]
Unified discrete diffusion for simultaneous vision-language generation
Minghui Hu, Chuanxia Zheng, Heliang Zheng, Tat-Jen Cham, Chaoyue Wang, Zuopeng Yang, Dacheng Tao, and Ponnuthurai N Suganthan. Unified discrete diffusion for simultaneous vision-language generation. arXiv preprint arXiv:2211.14842, 2022. 10
-
[24]
Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens
Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens. InICCV, 2025
work page 2025
-
[25]
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019
work page 2019
-
[26]
Reward guided latent consistency distillation.arXiv preprint arXiv:2403.11027, 2024
Jiachen Li, Weixi Feng, Wenhu Chen, and William Yang Wang. Reward guided latent consistency distillation.arXiv preprint arXiv:2403.11027, 2024
-
[27]
Instaflow: One step is enough for high- quality diffusion-based text-to-image generation
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high- quality diffusion-based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[28]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546, 2023
work page 2023
-
[32]
Learning few-step diffusion models by trajectory distribution matching
Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17719–17728, 2025
work page 2025
-
[33]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021
work page 2021
-
[39]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[40]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
work page 2022
-
[41]
Simple and effective masked diffusion language models
Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024
work page 2024
-
[42]
Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
work page 2016
-
[43]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Fast high-resolution image synthesis with latent adversarial diffusion distillation
Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024
work page 2024
-
[45]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024
work page 2024
-
[46]
Stylegan-xl: Scaling stylegan to large diverse datasets
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022
work page 2022
-
[47]
Christoph Schuhmann and Romain Beaumont. Laion-aesthetics.LAION. AI, 4, 2022. 11
work page 2022
-
[48]
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024
work page 2024
-
[49]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252, 2023
work page 2023
-
[52]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[53]
Score-based continuous-time discrete diffusion models
Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models.arXiv preprint arXiv:2211.16750, 2022
-
[54]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[55]
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024
work page 2024
-
[56]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024
work page 2024
-
[57]
Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025
Jaehoon Yoo, Wonjung Kim, and Seunghoon Hong. Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025
-
[58]
Guided score identity distillation for data-free one-step text-to-image generation
Mingyuan Zhou, Zhendong Wang, Huangjie Zheng, and Hai Huang. Long and short guidance in score identity distillation for one-step text-to-image generation.arXiv preprint arXiv:2406.01561, 3, 2024
-
[59]
Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[60]
Di [m] o: Distilling masked diffusion models into one-step generator
Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Di [m] o: Distilling masked diffusion models into one-step generator. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18606–18618, 2025
work page 2025
-
[61]
Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Soft-di [m] o: Improving one-step discrete image generation with soft embeddings.arXiv preprint arXiv:2509.22925, 2025. 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.