Progressive Distillation for Fast Sampling of Diffusion Models
Pith reviewed 2026-05-11 09:31 UTC · model grok-4.3
The pith
Progressive distillation reduces diffusion model sampling from thousands of steps to 4 while keeping high image quality on standard benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from a deterministic diffusion sampler that uses up to 8192 steps, the authors apply a repeated distillation procedure in which each new model is trained to reproduce the previous model's output distribution using half the number of steps; together with parameterizations that increase stability at low step counts, this yields usable models that generate samples in only 4 steps on CIFAR-10, ImageNet, and LSUN while preserving most of the original perceptual quality.
What carries the argument
The progressive distillation procedure, which trains a student diffusion model to match a teacher sampler's multi-step trajectory using half the steps, combined with re-parameterizations that stabilize few-step sampling.
Load-bearing premise
That successive rounds of distillation do not accumulate enough error to degrade image quality and that the new parameterizations keep sampling stable when the step count is reduced across different image datasets.
What would settle it
A direct comparison on CIFAR-10 or ImageNet in which the 4-step distilled model produces visibly worse samples or a substantially higher FID than the original 8192-step sampler, or in which further distillation rounds cause a sudden quality collapse.
read the original abstract
Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that new parameterizations of diffusion models increase stability for few-step sampling, and that a progressive distillation procedure can iteratively halve the number of sampling steps (from up to 8192 down to 4) while preserving perceptual quality on image generation tasks. It reports concrete results such as an FID of 3.0 on CIFAR-10 with 4 steps, along with results on ImageNet and LSUN, and states that the full distillation procedure takes no more time than training the original model.
Significance. If the empirical results hold, the work is significant for addressing the slow sampling drawback of diffusion models, enabling fast generation competitive with alternatives like GANs while retaining quality and density estimation advantages. The progressive distillation approach combined with the new parameterizations provides a practical, efficient solution, and the manuscript supplies falsifiable benchmark outcomes across multiple standard datasets.
major comments (2)
- [§5] §5 (Experimental results): The central claim that progressive distillation preserves perceptual quality down to 4 steps (e.g., CIFAR-10 FID of 3.0) is load-bearing, yet the reported benchmark numbers lack error bars, multiple random seed statistics, or ablations isolating the new parameterizations from the distillation procedure; this directly affects assessment of robustness against error accumulation.
- [§3.2] §3.2 (New parameterizations): The claim that the introduced parameterizations reliably stabilize few-step sampling is central to enabling the progressive procedure, but the section provides no analysis or equations demonstrating their effect on sampling dynamics or variance reduction, relying only on end-to-end empirical outcomes.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly state the exact sequence of distillation steps applied and the base model architectures used for each benchmark.
- [§4] Notation for the teacher-student alignment in the distillation loss could be clarified with an additional equation showing how the student is trained to match the teacher's multi-step trajectory.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work's significance and for the constructive feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where we will strengthen the presentation of results and analysis.
read point-by-point responses
-
Referee: [§5] §5 (Experimental results): The central claim that progressive distillation preserves perceptual quality down to 4 steps (e.g., CIFAR-10 FID of 3.0) is load-bearing, yet the reported benchmark numbers lack error bars, multiple random seed statistics, or ablations isolating the new parameterizations from the distillation procedure; this directly affects assessment of robustness against error accumulation.
Authors: We acknowledge that error bars, multi-seed statistics, and explicit ablations would strengthen the assessment of robustness. The manuscript reports results from single runs with fixed seeds for reproducibility, but demonstrates consistency by applying the same progressive procedure across CIFAR-10, ImageNet, and LSUN while preserving quality from 8192 steps down to 4. The load-bearing claim is further supported by the fact that each halving step maintains perceptual quality without retraining from scratch. To address the concern directly, we will revise §5 to include error bars from additional runs (where feasible given compute), a note on seed consistency, and a targeted ablation isolating the new parameterizations' contribution from the distillation steps. revision: yes
-
Referee: [§3.2] §3.2 (New parameterizations): The claim that the introduced parameterizations reliably stabilize few-step sampling is central to enabling the progressive procedure, but the section provides no analysis or equations demonstrating their effect on sampling dynamics or variance reduction, relying only on end-to-end empirical outcomes.
Authors: Section 3.2 introduces the new parameterizations (including the velocity parameterization) as direct modifications to the standard diffusion model output that reduce sensitivity to accumulated errors in few-step regimes. The section provides the explicit functional forms and motivates them via their effect on the reverse-process update. While the primary validation is through the end-to-end progressive distillation results, we agree that additional equations would clarify the variance-reduction mechanism. We will revise §3.2 to include the sampling update equations under these parameterizations and a short derivation showing how they lower the effective variance of the predicted clean image relative to noise prediction, thereby enabling stable halving. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical training procedure (progressive distillation) and new parameterizations for diffusion models, with all load-bearing claims consisting of experimental outcomes measured on held-out benchmarks such as CIFAR-10 FID scores. No equations, predictions, or first-principles derivations reduce outputs to inputs by construction, and no self-citations serve as the sole justification for the central method or results. The procedure is self-contained against external validation.
Axiom & Free-Parameter Ledger
free parameters (1)
- distillation hyperparameters
axioms (1)
- domain assumption Diffusion models admit parameterizations that remain stable under few-step sampling
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
-
Query Lower Bounds for Diffusion Sampling
Diffusion sampling from d-dimensional distributions requires at least ~sqrt(d) adaptive score queries when score estimates have polynomial accuracy.
-
DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation
DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adapt...
-
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
-
Generative Pseudo-Force Fields for Molecular Generation
Proposes generative pseudo-force fields trained on quadratic pseudo-potentials from noisy equilibria as a time-step-agnostic diffusion variant for efficient molecular conformation generation with high validity on QM9.
-
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
-
StreamingEffect: Real-Time Human-Centric Video Effect Generation
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
-
Training-Free Generative Sampling via Moment-Matched Score Smoothing
MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
-
Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation
A hypernetwork maps style motion embeddings to LoRA updates that stylize text-driven motion diffusion models with improved generalization to unseen styles via contrastive structuring of the style space.
-
One-Step Generative Modeling via Wasserstein Gradient Flows
W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
-
Muninn: Your Trajectory Diffusion Model But Faster
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
-
HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation
HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
-
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
-
PODiff: Latent Diffusion in Proper Orthogonal Decomposition Space for Scientific Super-Resolution
PODiff performs conditional diffusion in a fixed, variance-ordered POD latent space to enable efficient probabilistic super-resolution of high-dimensional scientific fields with lower memory and better-calibrated unce...
-
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...
-
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
-
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
-
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.
-
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
-
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
-
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
-
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
-
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
-
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
Stream-DiffVSR enables practical low-latency video super-resolution by combining a four-step distilled denoiser, auto-regressive temporal guidance, and a temporal processor in a strictly causal pipeline.
-
Large Video Planner Enables Generalizable Robot Control
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
StereoFoley: Object-Aware Stereo Audio Generation from Video
StereoFoley is an end-to-end video-to-stereo-audio framework that uses a base generative model fine-tuned on synthetic object-tracked data with panning and distance controls to achieve object-aware spatial sound.
-
Lipschitz-Guided Design of Interpolation Schedules in Generative Models
Minimizing averaged squared Lipschitzness of the drift produces interpolation schedules that improve numerical accuracy and mitigate mode collapse in generative models, with closed-form optima for Gaussians and valida...
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
Training-Free Inference for High-Resolution Sinogram Completion
HRSino is a training-free adaptive diffusion inference approach for high-resolution sinogram completion that reduces peak memory by up to 30.81% and inference time by up to 17.58% while maintaining accuracy.
-
History-Guided Video Diffusion
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
-
One Step Diffusion via Shortcut Models
Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
-
Diffusion Models Are Real-Time Game Engines
A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
-
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
-
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
-
Variance Reduction for Expectations with Diffusion Teachers
CARV amortizes upstream diffusion teacher costs over noise resamples with timestep importance sampling and stratified-inverse-CDF sampling, delivering 2-3x effective compute gains in text-to-3D experiments and order-o...
-
Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment
REPA-P aligns intermediate representations in diffusion models with physical states using first-principles PDE residuals to accelerate convergence and boost out-of-distribution robustness on PDE tasks.
-
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
LIFT and PLACE enable stable knowledge distillation for extremely lightweight diffusion models by decomposing the task into coarse alignment followed by fine refinement with piecewise local adaptive guidance.
-
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
LIFT and PLACE enable stable training of extremely compressed diffusion models by breaking distillation into coarse linear alignment followed by local adaptive refinement.
-
WavFlow: Audio Generation in Waveform Space
WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
-
DCFold: Efficient Protein Structure Generation with Single Forward Pass
DCFold achieves AlphaFold3-level protein structure prediction accuracy in a single forward pass using Dual Consistency training and a Temporal Geodesic Matching scheduler, delivering 15x inference acceleration.
-
Taming Audio VAEs via Target-KL Regularization
The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
-
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers
DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
-
ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency p...
-
FLASH: Efficient Visuomotor Policy via Sparse Sampling
FLASH Policy uses sparse Legendre polynomial trajectory fitting and history-anchored flow matching to enable single-step inference for visuomotor control, reporting 31.4 ms per-episode latency and >=92% success on fiv...
-
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
CLVR couples verified logical planning with pixel diffusion, uses proxy reinforcement learning on distilled histories, and merges weights to cut inference to 4 NFEs while outperforming open-source T2I models on comple...
-
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
-
ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.
-
Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies
A multivariate diffusion generative downscaling method preserves inter-variable correlations in climate data under large resolution increases, enabling more accurate compound risk assessment.
-
FlashMol: High-Quality Molecule Generation in as Few as Four Steps
FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...
-
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
-
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
-
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
-
WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis
WFM achieves near-diffusion quality for all four BraTS MRI modalities with one 82M model in 1-2 steps by flowing from the mean of conditioning modalities in wavelet space, running 250-1000x faster.
Reference graph
Works this paper leans on
-
[1]
Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. CoRR, abs/2107.03006,
-
[2]
Learning gradient fields for shape generation
Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Learning gradient fields for shape generation. arXiv preprint arXiv:2008.06520,
-
[3]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis.arXiv preprint arXiv:2105.05233,
work page internal anchor Pith review arXiv
-
[4]
FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models
Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367,
-
[5]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282,
-
[6]
Argmax flows and multinomial diffusion: Learning categorical distributions, 2021
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Towards non-autoregressive language models. arXiv preprint arXiv:2102.05379,
-
[7]
Gotta go fast when generating data with score-based models,
Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080,
-
[8]
Kingma, Tim Salimans, Ben Poole, and Jonathan Ho
Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630,
-
[9]
On fast sampling of diffusion probabilistic models,
10 Published as a conference paper at ICLR 2022 Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132,
-
[10]
Bilateral denoising diffusion models
Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514,
- [11]
-
[12]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388,
work page internal anchor Pith review arXiv
-
[14]
Non gaussian denoising diffusion models
Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diffusion models.arXiv preprint arXiv:2106.07582,
-
[15]
Fast generation for convolutional autoregressive models
Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. Fast genera- tion for convolutional autoregressive models. arXiv preprint arXiv:1704.06001,
-
[16]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement.arXiv preprint arXiv:2104.07636,
-
[17]
Noise estim ation for generative diffusion models
Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion mod- els. arXiv preprint arXiv:2104.02600,
-
[18]
Maximum likelihood training of score- based diffusion models
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score- based diffusion models. arXiv e-prints, pp. arXiv–2101, 2021b. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference ...
-
[19]
Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019a. 11 Published as a conference paper at ICLR 2022 Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference ...
-
[20]
InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2478–2488
Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efficiently sam- ple from diffusion probabilistic models. arXiv preprint arXiv:2106.03802,
-
[21]
12 Published as a conference paper at ICLR 2022 A P ROBABILITY FLOW ODE IN TERMS OF LOG -SNR Song et al. (2021c) formulate the forward diffusion process in terms of an SDE of the form dz =f (z,t )dt +g(t)dW, (10) and show that samples from this diffusion process can be generated by solving the associated prob- ability flow ODE: dz = [f (z,t ) − 1 2g2(t)∇z ...
work page 2022
-
[22]
is given by zs = σs σt [zt −αt ˆxθ(zt)] +αs ˆxθ(zt), (20) fors < t. Taking the derivative of this expression with respect to λs, assuming again a variance preserving diffusion process, and using dαλ dλ = 1 2αλσ2 λ and dσλ dλ = − 1 2σλα2 λ, gives zλs dλs = dσλs dλs 1 σt [zt −αt ˆxθ(zt)] + dαλs dλs ˆxθ(zt) (21) = − 1 2α2 s σs σt [zt −αt ˆxθ(zt)] + 1 2αsσ2 s...
work page 2022
-
[23]
Figure 5: Visualization of reparameterizing the diffusion process in terms ofφ and vφ. E S ETTINGS USED IN EXPERIMENTS Our model architectures closely follow those described by Dhariwal & Nichol (2021). For 64 × 64 ImageNet we use their model exactly, with 192 channels at the highest resolution. All other models are slight variations with different hyperp...
work page 2021
-
[24]
We use single-headed attention, and only apply this at the 16 × 16 and 8 × 8 resolutions
At each resolution we apply 3 residual blocks, like described by Dhariwal & Nichol (2021). We use single-headed attention, and only apply this at the 16 × 16 and 8 × 8 resolutions. We use dropout of 0.2 when training the original model. No dropout is used during distillation. For LSUN we use a model similar to that for ImageNet, but with a reduced number ...
work page 2021
-
[25]
We clip the norm of gradients to a global norm of 1 before calculating parameter updates
with a constant of 0.001. We clip the norm of gradients to a global norm of 1 before calculating parameter updates. For CIFAR-10 we train for 800k parameter updates, for ImageNet we use 550k updates, and for LSUN we use 400k updates. During distillation we train for 50k updates per iteration, except for the distillation to 2 and 1 sampling steps, for whic...
work page 2022
-
[26]
25612864321684212 3 4 5 6 78910 20 sampling steps FID 64x64 ImageNet Distilled DDIM Distilled Stochastic Undistilled Stochastic Figure 6: FID of generated samples from distilled and undistilled models, using DDIM or stochastic sampling. For the stochastic sampling results we present the best FID obtained by a grid-search over 11 possible noise levels, spa...
work page 2020
-
[27]
forms a non-Gaussian distribution that falls outside the family of Gaus- sian distributions that can be modelled by a single DDPM student step: A multi-step stochastic DDPM sampler can thus not be distilled into a few-step sampler without some loss in fidelity. This is in contrast with the deterministic DDIM sampler: here both the two-step DDIM teacher upd...
work page 2021
-
[28]
For each schedule we selected the optimal learning rate from [5e−5, 1e−4, 2e−4, 3e−4]
All reported numbers are averages over 4 random seeds. For each schedule we selected the optimal learning rate from [5e−5, 1e−4, 2e−4, 3e−4]. 20 Published as a conference paper at ICLR 2022 25612864321684212 3 4 5678910 20 sampling steps FID 64x64 ImageNet 50k updates10k updates 2561286432168421 3 4 5678910 20 sampling steps 128x128 LSUN Bedrooms 50k upda...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.