Progressive Distillation for Fast Sampling of Diffusion Models

Jonathan Ho; Tim Salimans

arxiv: 2202.00512 · v2 · submitted 2022-02-01 · 💻 cs.LG · cs.AI· stat.ML

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans , Jonathan Ho This is my paper

Pith reviewed 2026-05-11 09:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords diffusion modelsprogressive distillationfast samplingimage generationgenerative modelingFID scoreCIFAR-10few-step sampling

0 comments

The pith

Progressive distillation reduces diffusion model sampling from thousands of steps to 4 while keeping high image quality on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to remove the main practical drawback of diffusion models—the need for hundreds or thousands of model evaluations to produce one sample—by combining two changes. First, it introduces new parameterizations that make the models more stable when run with very few steps. Second, it shows a distillation process that takes a trained many-step sampler and trains a new model to match its outputs using half as many steps, then repeats the halving until only 4 steps remain. A reader would care because the resulting models still reach low FID scores, for example 3.0 on CIFAR-10, and the entire sequence of distillations costs no more training time than the original model.

Core claim

Starting from a deterministic diffusion sampler that uses up to 8192 steps, the authors apply a repeated distillation procedure in which each new model is trained to reproduce the previous model's output distribution using half the number of steps; together with parameterizations that increase stability at low step counts, this yields usable models that generate samples in only 4 steps on CIFAR-10, ImageNet, and LSUN while preserving most of the original perceptual quality.

What carries the argument

The progressive distillation procedure, which trains a student diffusion model to match a teacher sampler's multi-step trajectory using half the steps, combined with re-parameterizations that stabilize few-step sampling.

Load-bearing premise

That successive rounds of distillation do not accumulate enough error to degrade image quality and that the new parameterizations keep sampling stable when the step count is reduced across different image datasets.

What would settle it

A direct comparison on CIFAR-10 or ImageNet in which the 4-step distilled model produces visibly worse samples or a substantially higher FID than the original 8192-step sampler, or in which further distillation rounds cause a sudden quality collapse.

read the original abstract

Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This shows how to distill diffusion models from thousands of steps down to four while keeping FID scores competitive on CIFAR-10 and similar benchmarks.

read the letter

This paper shows how to distill diffusion models from thousands of steps down to four while keeping FID scores competitive on CIFAR-10 and similar benchmarks. They start with a high-step deterministic sampler and repeatedly train a new model to match it but with half the steps, applying the process several times until they reach four steps. Along the way they introduce new parameterizations that improve stability for low-step sampling. The headline result is an FID of 3.0 on CIFAR-10 at four steps, with comparable preservation of quality on ImageNet and LSUN. The full sequence of distillations takes no more wall-clock time than training the original model once. What stands out is the iterative halving approach itself. Earlier distillation work tended to target a single large reduction in steps; here the progressive schedule lets them maintain quality across multiple halvings without a sharp drop at any stage. The efficiency claim is also useful because it removes the usual worry that extra training stages will dominate the compute budget. The soft spots are limited but worth noting. The abstract gives clean headline numbers yet leaves out error bars, ablations on the new parameterizations, and checks on how much the outcome depends on the exact distillation hyperparameters or random seeds. If the full paper supplies those controls and shows the procedure is not overly brittle, the results will be more convincing. The work also stays within standard image benchmarks, so its behavior on other data types remains open. This is aimed at anyone who trains or deploys diffusion models and needs faster sampling without retraining from scratch. A practitioner looking for a drop-in speed-up recipe will find the method straightforward to try once the details are in hand. It deserves a serious referee because the empirical gains are large enough and the procedure is simple enough that expert feedback on reproducibility and edge cases would be valuable to the community.

Referee Report

2 major / 2 minor

Summary. The paper claims that new parameterizations of diffusion models increase stability for few-step sampling, and that a progressive distillation procedure can iteratively halve the number of sampling steps (from up to 8192 down to 4) while preserving perceptual quality on image generation tasks. It reports concrete results such as an FID of 3.0 on CIFAR-10 with 4 steps, along with results on ImageNet and LSUN, and states that the full distillation procedure takes no more time than training the original model.

Significance. If the empirical results hold, the work is significant for addressing the slow sampling drawback of diffusion models, enabling fast generation competitive with alternatives like GANs while retaining quality and density estimation advantages. The progressive distillation approach combined with the new parameterizations provides a practical, efficient solution, and the manuscript supplies falsifiable benchmark outcomes across multiple standard datasets.

major comments (2)

[§5] §5 (Experimental results): The central claim that progressive distillation preserves perceptual quality down to 4 steps (e.g., CIFAR-10 FID of 3.0) is load-bearing, yet the reported benchmark numbers lack error bars, multiple random seed statistics, or ablations isolating the new parameterizations from the distillation procedure; this directly affects assessment of robustness against error accumulation.
[§3.2] §3.2 (New parameterizations): The claim that the introduced parameterizations reliably stabilize few-step sampling is central to enabling the progressive procedure, but the section provides no analysis or equations demonstrating their effect on sampling dynamics or variance reduction, relying only on end-to-end empirical outcomes.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the exact sequence of distillation steps applied and the base model architectures used for each benchmark.
[§4] Notation for the teacher-student alignment in the distillation loss could be clarified with an additional equation showing how the student is trained to match the teacher's multi-step trajectory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where we will strengthen the presentation of results and analysis.

read point-by-point responses

Referee: [§5] §5 (Experimental results): The central claim that progressive distillation preserves perceptual quality down to 4 steps (e.g., CIFAR-10 FID of 3.0) is load-bearing, yet the reported benchmark numbers lack error bars, multiple random seed statistics, or ablations isolating the new parameterizations from the distillation procedure; this directly affects assessment of robustness against error accumulation.

Authors: We acknowledge that error bars, multi-seed statistics, and explicit ablations would strengthen the assessment of robustness. The manuscript reports results from single runs with fixed seeds for reproducibility, but demonstrates consistency by applying the same progressive procedure across CIFAR-10, ImageNet, and LSUN while preserving quality from 8192 steps down to 4. The load-bearing claim is further supported by the fact that each halving step maintains perceptual quality without retraining from scratch. To address the concern directly, we will revise §5 to include error bars from additional runs (where feasible given compute), a note on seed consistency, and a targeted ablation isolating the new parameterizations' contribution from the distillation steps. revision: yes
Referee: [§3.2] §3.2 (New parameterizations): The claim that the introduced parameterizations reliably stabilize few-step sampling is central to enabling the progressive procedure, but the section provides no analysis or equations demonstrating their effect on sampling dynamics or variance reduction, relying only on end-to-end empirical outcomes.

Authors: Section 3.2 introduces the new parameterizations (including the velocity parameterization) as direct modifications to the standard diffusion model output that reduce sensitivity to accumulated errors in few-step regimes. The section provides the explicit functional forms and motivates them via their effect on the reverse-process update. While the primary validation is through the end-to-end progressive distillation results, we agree that additional equations would clarify the variance-reduction mechanism. We will revise §3.2 to include the sampling update equations under these parameterizations and a short derivation showing how they lower the effective variance of the predicted clean image relative to noise prediction, thereby enabling stable halving. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical training procedure (progressive distillation) and new parameterizations for diffusion models, with all load-bearing claims consisting of experimental outcomes measured on held-out benchmarks such as CIFAR-10 FID scores. No equations, predictions, or first-principles derivations reduce outputs to inputs by construction, and no self-citations serve as the sole justification for the central method or results. The procedure is self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion model assumptions plus the empirical claim that distillation can be applied progressively without quality collapse. No new physical entities or unstated mathematical axioms beyond typical ML training.

free parameters (1)

distillation hyperparameters
Choices such as learning rate and step-halving schedule are tuned to achieve reported results.

axioms (1)

domain assumption Diffusion models admit parameterizations that remain stable under few-step sampling
Invoked as the first contribution enabling distillation.

pith-pipeline@v0.9.0 · 10101 in / 1056 out tokens · 78185 ms · 2026-05-11T09:31:49.097576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
cs.CV 2026-05 unverdicted novelty 8.0

CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
Query Lower Bounds for Diffusion Sampling
cs.LG 2026-04 unverdicted novelty 8.0

Diffusion sampling from d-dimensional distributions requires at least ~sqrt(d) adaptive score queries when score estimates have polynomial accuracy.
DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adapt...
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
cs.CV 2026-05 unverdicted novelty 7.0

VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
Generative Pseudo-Force Fields for Molecular Generation
cs.LG 2026-05 unverdicted novelty 7.0

Proposes generative pseudo-force fields trained on quadratic pseudo-potentials from noisy equilibria as a time-step-agnostic diffusion variant for efficient molecular conformation generation with high validity on QM9.
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
StreamingEffect: Real-Time Human-Centric Video Effect Generation
cs.CV 2026-05 unverdicted novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Training-Free Generative Sampling via Moment-Matched Score Smoothing
stat.ML 2026-05 unverdicted novelty 7.0

MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation
cs.CV 2026-05 unverdicted novelty 7.0

A hypernetwork maps style motion embeddings to LoRA updates that stylize text-driven motion diffusion models with improved generalization to unseen styles via contrastive structuring of the style space.
One-Step Generative Modeling via Wasserstein Gradient Flows
cs.LG 2026-05 conditional novelty 7.0

W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
Muninn: Your Trajectory Diffusion Model But Faster
cs.RO 2026-05 unverdicted novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation
cs.HC 2026-05 unverdicted novelty 7.0

HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
cs.CV 2026-05 unverdicted novelty 7.0

LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
PODiff: Latent Diffusion in Proper Orthogonal Decomposition Space for Scientific Super-Resolution
cs.LG 2026-05 unverdicted novelty 7.0

PODiff performs conditional diffusion in a fixed, variance-ordered POD latent space to enable efficient probabilistic super-resolution of high-dimensional scientific fields with lower memory and better-calibrated unce...
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
cs.CV 2026-05 unverdicted novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
cs.CV 2026-04 unverdicted novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
cs.CV 2026-04 unverdicted novelty 7.0

Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
cs.CV 2026-04 unverdicted novelty 7.0

A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
cs.CV 2026-04 unverdicted novelty 7.0

Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
cs.CV 2026-04 conditional novelty 7.0

1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
cs.CV 2026-03 unverdicted novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 unverdicted novelty 7.0

Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
cs.CV 2026-02 unverdicted novelty 7.0

DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
cs.LG 2026-02 unverdicted novelty 7.0

Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
cs.CV 2025-12 conditional novelty 7.0

Stream-DiffVSR enables practical low-latency video super-resolution by combining a four-step distilled denoiser, auto-regressive temporal guidance, and a temporal processor in a strictly causal pipeline.
Large Video Planner Enables Generalizable Robot Control
cs.RO 2025-12 conditional novelty 7.0

A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
StereoFoley: Object-Aware Stereo Audio Generation from Video
cs.SD 2025-09 conditional novelty 7.0

StereoFoley is an end-to-end video-to-stereo-audio framework that uses a base generative model fine-tuned on synthetic object-tracked data with panning and distance controls to achieve object-aware spatial sound.
Lipschitz-Guided Design of Interpolation Schedules in Generative Models
stat.ML 2025-09 unverdicted novelty 7.0

Minimizing averaged squared Lipschitzness of the drift produces interpolation schedules that improve numerical accuracy and mitigate mode collapse in generative models, with closed-form optima for Gaussians and valida...
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
cs.AI 2025-07 unverdicted novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
Training-Free Inference for High-Resolution Sinogram Completion
cs.CV 2025-06 unverdicted novelty 7.0

HRSino is a training-free adaptive diffusion inference approach for high-resolution sinogram completion that reduces peak memory by up to 30.81% and inference time by up to 17.58% while maintaining accuracy.
History-Guided Video Diffusion
cs.LG 2025-02 unverdicted novelty 7.0

DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
One Step Diffusion via Shortcut Models
cs.LG 2024-10 conditional novelty 7.0

Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
Diffusion Models Are Real-Time Game Engines
cs.LG 2024-08 conditional novelty 7.0

A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
cs.CV 2023-10 unverdicted novelty 7.0

Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
cs.LG 2022-08 unverdicted novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
cs.CV 2026-05 conditional novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
Variance Reduction for Expectations with Diffusion Teachers
cs.LG 2026-05 unverdicted novelty 6.0

CARV amortizes upstream diffusion teacher costs over noise resamples with timestep importance sampling and stratified-inverse-CDF sampling, delivering 2-3x effective compute gains in text-to-3D experiments and order-o...
Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment
cs.LG 2026-05 unverdicted novelty 6.0

REPA-P aligns intermediate representations in diffusion models with physical states using first-principles PDE residuals to accelerate convergence and boost out-of-distribution robustness on PDE tasks.
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

LIFT and PLACE enable stable knowledge distillation for extremely lightweight diffusion models by decomposing the task into coarse alignment followed by fine refinement with piecewise local adaptive guidance.
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

LIFT and PLACE enable stable training of extremely compressed diffusion models by breaking distillation into coarse linear alignment followed by local adaptive refinement.
WavFlow: Audio Generation in Waveform Space
cs.SD 2026-05 conditional novelty 6.0

WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
DCFold: Efficient Protein Structure Generation with Single Forward Pass
cs.LG 2026-05 unverdicted novelty 6.0

DCFold achieves AlphaFold3-level protein structure prediction accuracy in a single forward pass using Dual Consistency training and a Temporal Geodesic Matching scheduler, delivering 15x inference acceleration.
Taming Audio VAEs via Target-KL Regularization
cs.SD 2026-05 unverdicted novelty 6.0

The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
cs.CV 2026-05 unverdicted novelty 6.0

ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency p...
FLASH: Efficient Visuomotor Policy via Sparse Sampling
cs.RO 2026-05 unverdicted novelty 6.0

FLASH Policy uses sparse Legendre polynomial trajectory fitting and history-anchored flow matching to enable single-step inference for visuomotor control, reporting 31.4 ms per-episode latency and >=92% success on fiv...
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

CLVR couples verified logical planning with pixel diffusion, uses proxy reinforcement learning on distilled histories, and merges weights to cut inference to 4 NFEs while outperforming open-source T2I models on comple...
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
cs.LG 2026-05 conditional novelty 6.0

ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.
Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies
physics.ao-ph 2026-05 unverdicted novelty 6.0

A multivariate diffusion generative downscaling method preserves inter-variable correlations in climate data under large resolution increases, enabling more accurate compound risk assessment.
FlashMol: High-Quality Molecule Generation in as Few as Four Steps
cs.LG 2026-05 unverdicted novelty 6.0

FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
cs.CV 2026-04 unverdicted novelty 6.0

MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

WFM achieves near-diffusion quality for all four BraTS MRI modalities with one 82M model in 1-2 steps by flowing from the mean of conditioning modalities in wavelet space, running 250-1000x faster.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 111 Pith papers · 3 internal anchors

[1]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. CoRR, abs/2107.03006,

work page arXiv
[2]

Learning gradient ﬁelds for shape generation

Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Learning gradient ﬁelds for shape generation. arXiv preprint arXiv:2008.06520,

work page arXiv 2008
[3]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis.arXiv preprint arXiv:2105.05233,

work page internal anchor Pith review arXiv
[4]

FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367,

work page Pith review arXiv
[5]

Cascaded diffusion models for high fidelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation. arXiv preprint arXiv:2106.15282,

work page arXiv
[6]

Argmax flows and multinomial diffusion: Learning categorical distributions, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax ﬂows and multinomial diffusion: Towards non-autoregressive language models. arXiv preprint arXiv:2102.05379,

work page arXiv
[7]

Gotta go fast when generating data with score-based models,

Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080,

work page arXiv
[8]

Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630,

work page arXiv
[9]

On fast sampling of diffusion probabilistic models,

10 Published as a conference paper at ICLR 2022 Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132,

work page arXiv 2022
[10]

Bilateral denoising diffusion models

Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514,

work page arXiv
[11]

Li and Y

Haoying Li, Yifan Yang, Meng Chang, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. arXiv preprint arXiv:2104.14951,

work page arXiv
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388,

work page internal anchor Pith review arXiv
[14]

Non gaussian denoising diﬀusion models

Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diffusion models.arXiv preprint arXiv:2106.07582,

work page arXiv
[15]

Fast generation for convolutional autoregressive models

Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. Fast genera- tion for convolutional autoregressive models. arXiv preprint arXiv:1704.06001,

work page arXiv
[16]

Saharia, J

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative reﬁnement.arXiv preprint arXiv:2104.07636,

work page arXiv
[17]

Noise estim ation for generative diﬀusion models

Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion mod- els. arXiv preprint arXiv:2104.02600,

work page arXiv
[18]

Maximum likelihood training of score- based diffusion models

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score- based diffusion models. arXiv e-prints, pp. arXiv–2101, 2021b. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference ...

work page arXiv 2004
[19]

Neural Stochastic Differ- ential Equations: Deep Latent Gaussian Models in the Diffu- sion Limit, 2019

Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019a. 11 Published as a conference paper at ICLR 2022 Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference ...

work page arXiv 1905
[20]

InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2478–2488

Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efﬁciently sam- ple from diffusion probabilistic models. arXiv preprint arXiv:2106.03802,

work page arXiv
[21]

12 Published as a conference paper at ICLR 2022 A P ROBABILITY FLOW ODE IN TERMS OF LOG -SNR Song et al. (2021c) formulate the forward diffusion process in terms of an SDE of the form dz =f (z,t )dt +g(t)dW, (10) and show that samples from this diffusion process can be generated by solving the associated prob- ability ﬂow ODE: dz = [f (z,t ) − 1 2g2(t)∇z ...

work page 2022
[22]

is given by zs = σs σt [zt −αt ˆxθ(zt)] +αs ˆxθ(zt), (20) fors < t. Taking the derivative of this expression with respect to λs, assuming again a variance preserving diffusion process, and using dαλ dλ = 1 2αλσ2 λ and dσλ dλ = − 1 2σλα2 λ, gives zλs dλs = dσλs dλs 1 σt [zt −αt ˆxθ(zt)] + dαλs dλs ˆxθ(zt) (21) = − 1 2α2 s σs σt [zt −αt ˆxθ(zt)] + 1 2αsσ2 s...

work page 2022
[23]

E S ETTINGS USED IN EXPERIMENTS Our model architectures closely follow those described by Dhariwal & Nichol (2021)

Figure 5: Visualization of reparameterizing the diffusion process in terms ofφ and vφ. E S ETTINGS USED IN EXPERIMENTS Our model architectures closely follow those described by Dhariwal & Nichol (2021). For 64 × 64 ImageNet we use their model exactly, with 192 channels at the highest resolution. All other models are slight variations with different hyperp...

work page 2021
[24]

We use single-headed attention, and only apply this at the 16 × 16 and 8 × 8 resolutions

At each resolution we apply 3 residual blocks, like described by Dhariwal & Nichol (2021). We use single-headed attention, and only apply this at the 16 × 16 and 8 × 8 resolutions. We use dropout of 0.2 when training the original model. No dropout is used during distillation. For LSUN we use a model similar to that for ImageNet, but with a reduced number ...

work page 2021
[25]

We clip the norm of gradients to a global norm of 1 before calculating parameter updates

with a constant of 0.001. We clip the norm of gradients to a global norm of 1 before calculating parameter updates. For CIFAR-10 we train for 800k parameter updates, for ImageNet we use 550k updates, and for LSUN we use 400k updates. During distillation we train for 50k updates per iteration, except for the distillation to 2 and 1 sampling steps, for whic...

work page 2022
[26]

25612864321684212 3 4 5 6 78910 20 sampling steps FID 64x64 ImageNet Distilled DDIM Distilled Stochastic Undistilled Stochastic Figure 6: FID of generated samples from distilled and undistilled models, using DDIM or stochastic sampling. For the stochastic sampling results we present the best FID obtained by a grid-search over 11 possible noise levels, spa...

work page 2020
[27]

forms a non-Gaussian distribution that falls outside the family of Gaus- sian distributions that can be modelled by a single DDPM student step: A multi-step stochastic DDPM sampler can thus not be distilled into a few-step sampler without some loss in ﬁdelity. This is in contrast with the deterministic DDIM sampler: here both the two-step DDIM teacher upd...

work page 2021
[28]

For each schedule we selected the optimal learning rate from [5e−5, 1e−4, 2e−4, 3e−4]

All reported numbers are averages over 4 random seeds. For each schedule we selected the optimal learning rate from [5e−5, 1e−4, 2e−4, 3e−4]. 20 Published as a conference paper at ICLR 2022 25612864321684212 3 4 5678910 20 sampling steps FID 64x64 ImageNet 50k updates10k updates 2561286432168421 3 4 5678910 20 sampling steps 128x128 LSUN Bedrooms 50k upda...

work page 2022

[1] [1]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. CoRR, abs/2107.03006,

work page arXiv

[2] [2]

Learning gradient ﬁelds for shape generation

Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Learning gradient ﬁelds for shape generation. arXiv preprint arXiv:2008.06520,

work page arXiv 2008

[3] [3]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis.arXiv preprint arXiv:2105.05233,

work page internal anchor Pith review arXiv

[4] [4]

FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367,

work page Pith review arXiv

[5] [5]

Cascaded diffusion models for high fidelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation. arXiv preprint arXiv:2106.15282,

work page arXiv

[6] [6]

Argmax flows and multinomial diffusion: Learning categorical distributions, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax ﬂows and multinomial diffusion: Towards non-autoregressive language models. arXiv preprint arXiv:2102.05379,

work page arXiv

[7] [7]

Gotta go fast when generating data with score-based models,

Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080,

work page arXiv

[8] [8]

Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630,

work page arXiv

[9] [9]

On fast sampling of diffusion probabilistic models,

10 Published as a conference paper at ICLR 2022 Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132,

work page arXiv 2022

[10] [10]

Bilateral denoising diffusion models

Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514,

work page arXiv

[11] [11]

Li and Y

Haoying Li, Yifan Yang, Meng Chang, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. arXiv preprint arXiv:2104.14951,

work page arXiv

[12] [12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388,

work page internal anchor Pith review arXiv

[14] [14]

Non gaussian denoising diﬀusion models

Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diffusion models.arXiv preprint arXiv:2106.07582,

work page arXiv

[15] [15]

Fast generation for convolutional autoregressive models

Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. Fast genera- tion for convolutional autoregressive models. arXiv preprint arXiv:1704.06001,

work page arXiv

[16] [16]

Saharia, J

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative reﬁnement.arXiv preprint arXiv:2104.07636,

work page arXiv

[17] [17]

Noise estim ation for generative diﬀusion models

Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion mod- els. arXiv preprint arXiv:2104.02600,

work page arXiv

[18] [18]

Maximum likelihood training of score- based diffusion models

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score- based diffusion models. arXiv e-prints, pp. arXiv–2101, 2021b. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference ...

work page arXiv 2004

[19] [19]

Neural Stochastic Differ- ential Equations: Deep Latent Gaussian Models in the Diffu- sion Limit, 2019

Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019a. 11 Published as a conference paper at ICLR 2022 Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference ...

work page arXiv 1905

[20] [20]

InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2478–2488

Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efﬁciently sam- ple from diffusion probabilistic models. arXiv preprint arXiv:2106.03802,

work page arXiv

[21] [21]

12 Published as a conference paper at ICLR 2022 A P ROBABILITY FLOW ODE IN TERMS OF LOG -SNR Song et al. (2021c) formulate the forward diffusion process in terms of an SDE of the form dz =f (z,t )dt +g(t)dW, (10) and show that samples from this diffusion process can be generated by solving the associated prob- ability ﬂow ODE: dz = [f (z,t ) − 1 2g2(t)∇z ...

work page 2022

[22] [22]

is given by zs = σs σt [zt −αt ˆxθ(zt)] +αs ˆxθ(zt), (20) fors < t. Taking the derivative of this expression with respect to λs, assuming again a variance preserving diffusion process, and using dαλ dλ = 1 2αλσ2 λ and dσλ dλ = − 1 2σλα2 λ, gives zλs dλs = dσλs dλs 1 σt [zt −αt ˆxθ(zt)] + dαλs dλs ˆxθ(zt) (21) = − 1 2α2 s σs σt [zt −αt ˆxθ(zt)] + 1 2αsσ2 s...

work page 2022

[23] [23]

E S ETTINGS USED IN EXPERIMENTS Our model architectures closely follow those described by Dhariwal & Nichol (2021)

Figure 5: Visualization of reparameterizing the diffusion process in terms ofφ and vφ. E S ETTINGS USED IN EXPERIMENTS Our model architectures closely follow those described by Dhariwal & Nichol (2021). For 64 × 64 ImageNet we use their model exactly, with 192 channels at the highest resolution. All other models are slight variations with different hyperp...

work page 2021

[24] [24]

We use single-headed attention, and only apply this at the 16 × 16 and 8 × 8 resolutions

At each resolution we apply 3 residual blocks, like described by Dhariwal & Nichol (2021). We use single-headed attention, and only apply this at the 16 × 16 and 8 × 8 resolutions. We use dropout of 0.2 when training the original model. No dropout is used during distillation. For LSUN we use a model similar to that for ImageNet, but with a reduced number ...

work page 2021

[25] [25]

We clip the norm of gradients to a global norm of 1 before calculating parameter updates

with a constant of 0.001. We clip the norm of gradients to a global norm of 1 before calculating parameter updates. For CIFAR-10 we train for 800k parameter updates, for ImageNet we use 550k updates, and for LSUN we use 400k updates. During distillation we train for 50k updates per iteration, except for the distillation to 2 and 1 sampling steps, for whic...

work page 2022

[26] [26]

25612864321684212 3 4 5 6 78910 20 sampling steps FID 64x64 ImageNet Distilled DDIM Distilled Stochastic Undistilled Stochastic Figure 6: FID of generated samples from distilled and undistilled models, using DDIM or stochastic sampling. For the stochastic sampling results we present the best FID obtained by a grid-search over 11 possible noise levels, spa...

work page 2020

[27] [27]

forms a non-Gaussian distribution that falls outside the family of Gaus- sian distributions that can be modelled by a single DDPM student step: A multi-step stochastic DDPM sampler can thus not be distilled into a few-step sampler without some loss in ﬁdelity. This is in contrast with the deterministic DDIM sampler: here both the two-step DDIM teacher upd...

work page 2021

[28] [28]

For each schedule we selected the optimal learning rate from [5e−5, 1e−4, 2e−4, 3e−4]

All reported numbers are averages over 4 random seeds. For each schedule we selected the optimal learning rate from [5e−5, 1e−4, 2e−4, 3e−4]. 20 Published as a conference paper at ICLR 2022 25612864321684212 3 4 5678910 20 sampling steps FID 64x64 ImageNet 50k updates10k updates 2561286432168421 3 4 5678910 20 sampling steps 128x128 LSUN Bedrooms 50k upda...

work page 2022