arxiv: 2310.04378 · v1 · submitted 2023-10-06 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo , Yiqin Tan , Longbo Huang , Jian Li , Hang Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:06 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords latent consistency modelsfew-step inferencetext-to-image generationdiffusion modelsimage synthesisprobability flow ODEdistillation

0 comments

The pith

Latent Consistency Models enable high-resolution image synthesis in 2 to 4 inference steps by directly predicting ODE solutions in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Latent Consistency Models that treat the guided reverse diffusion process as solving an augmented probability flow ODE and directly predict its solution in latent space. This removes the need for dozens of iterative steps while preserving high fidelity when distilled from pre-trained classifier-free guided diffusion models. A 768 by 768 LCM can be trained in 32 A100 GPU hours and supports rapid sampling. The authors also present Latent Consistency Fine-tuning to adapt the models to custom image datasets. If the approach holds, text-to-image generation becomes far less computationally demanding and more practical for everyday use.

Core claim

Latent Consistency Models are designed to directly predict the solution of the augmented probability flow ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling on any pre-trained LDM including Stable Diffusion after efficient distillation from classifier-free guided models.

What carries the argument

Latent Consistency Model that directly predicts the solution of the augmented probability flow ODE in latent space.

If this is right

High-quality 768 by 768 images can be produced with only 2 to 4 sampling steps.
Training a capable LCM requires just 32 A100 GPU hours.
State-of-the-art text-to-image results are achievable under few-step inference constraints.
Latent Consistency Fine-tuning adapts the models to specialized image collections with low additional cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interactive or real-time image generation tools could run on consumer hardware without server-scale resources.
The same distillation pattern might apply to other latent-space generative tasks such as video or 3D synthesis.
Widespread adoption would reduce total energy use for large-scale image generation services.

Load-bearing premise

The consistency property learned via distillation in latent space will preserve high visual fidelity and text alignment across diverse prompts without iterative refinement.

What would settle it

A controlled test on the LAION-5B-Aesthetics dataset showing that 2-step LCM outputs have substantially higher FID scores or lower human preference ratings for quality and prompt alignment than 50-step baseline LDM outputs on the same prompts would falsify the claim.

read the original abstract

Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LCMs adapt consistency distillation to pre-trained latent diffusion models for 2-4 step sampling at low training cost, but the fidelity and alignment claims hinge on unshown empirical checks.

read the letter

The core advance here is distilling a consistency function directly in the latent space of models like Stable Diffusion so that a single forward pass solves the guided PF-ODE trajectory. They frame the classifier-free guided reverse process as an augmented ODE and train the network to map any point on the trajectory to the clean endpoint. The training budget is modest—32 A100 hours for a 768x768 model—and they add LCF as a fine-tuning step for custom datasets. This is a straightforward engineering extension of Song et al.'s consistency models that avoids retraining the base LDM from scratch and delivers usable few-step inference out of the box. The LAION-5B-Aesthetics results are presented as state-of-the-art for the few-step regime, which would matter for anyone who needs faster sampling without sacrificing too much quality. The method is reproducible in principle because it starts from public checkpoints and specifies the distillation procedure. The soft spot is exactly the one the stress-test note flags: once you fold guidance into the ODE and enforce consistency only along the distilled trajectories, there is no guarantee that the mapping generalizes to out-of-distribution prompts or high guidance scales the way iterative sampling does. Artifacts or text drift can appear because the model no longer gets the benefit of error correction over many steps. The abstract gives no numbers, ablations, or direct comparisons to other accelerated baselines, so the SOTA claim is hard to evaluate from the summary alone. If the full paper includes those controls and human preference data, the practical contribution is real. This is for people building or deploying text-to-image systems who care about latency more than squeezing out the last bit of FID. It is worth sending to review because the adaptation is clean and the compute numbers are credible, even if the experiments will need tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Latent Consistency Models (LCMs) obtained by distilling a consistency function from pre-trained classifier-free guided latent diffusion models (LDMs) such as Stable Diffusion. By framing the guided reverse process as an augmented probability-flow ODE in latent space, LCMs are designed to map any latent point directly to the clean image in 2-4 steps. The authors report that a 768×768 LCM can be trained in 32 A100 GPU hours and claim state-of-the-art text-to-image performance on the LAION-5B-Aesthetics dataset; they additionally propose Latent Consistency Fine-tuning (LCF) for dataset-specific adaptation.

Significance. If the empirical results are robust, the work would be significant: it reduces the inference cost of high-resolution diffusion models by roughly an order of magnitude while preserving quality, directly addressing the primary practical limitation of LDMs. The low reported training budget and the introduction of a fine-tuning procedure further increase the potential impact for both research and deployment.

major comments (3)

[Experiments] Experiments section: the central SOTA claim is supported only by qualitative examples and aggregate statements; no quantitative tables compare FID, CLIP score, or human preference against strong few-step baselines (e.g., distilled Consistency Models, progressive distillation, or SD with 4-step DPM-Solver) on the same LAION-5B-Aesthetics split. Without these numbers the performance assertion cannot be verified.
[Method] Method (distillation procedure): the consistency loss is applied after folding classifier-free guidance into the PF-ODE, yet no ablation quantifies how well the learned consistency function preserves alignment at guidance scales >7.5 or on out-of-distribution prompts. This directly tests the skeptic concern that iterative error correction is being replaced by an unverified generalization assumption.
[Experiments] Table 1 / Figure 4 (if present): the reported 32 A100-hour training budget is given without breakdown of batch size, number of distillation iterations, or teacher sampling cost; this makes it impossible to assess whether the efficiency claim is reproducible or comparable to prior distillation work.

minor comments (2)

[Method] Notation: the augmented PF-ODE is introduced without an explicit equation number; adding Eq. (X) for the guided velocity field would clarify how the consistency target is constructed.
The project page link is given but the manuscript does not state whether code or checkpoints will be released, which is standard for reproducibility in this area.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: the central SOTA claim is supported only by qualitative examples and aggregate statements; no quantitative tables compare FID, CLIP score, or human preference against strong few-step baselines (e.g., distilled Consistency Models, progressive distillation, or SD with 4-step DPM-Solver) on the same LAION-5B-Aesthetics split. Without these numbers the performance assertion cannot be verified.

Authors: We agree that quantitative metrics are essential to rigorously support the state-of-the-art claim. In the revised manuscript we will add a dedicated table reporting FID and CLIP scores for LCMs against the suggested few-step baselines (distilled Consistency Models, progressive distillation, and 4-step DPM-Solver) on the identical LAION-5B-Aesthetics evaluation split. These metrics have been computed and will be included to allow direct verification of the performance assertions. revision: yes
Referee: [Method] Method (distillation procedure): the consistency loss is applied after folding classifier-free guidance into the PF-ODE, yet no ablation quantifies how well the learned consistency function preserves alignment at guidance scales >7.5 or on out-of-distribution prompts. This directly tests the skeptic concern that iterative error correction is being replaced by an unverified generalization assumption.

Authors: This is a fair point about potential limitations in generalization. We will add a new ablation subsection (and corresponding figure) that systematically evaluates text-image alignment at guidance scales from 5 to 15 and on a set of out-of-distribution prompts. The results will quantify how well the distilled consistency function maintains prompt adherence without relying on iterative refinement. revision: yes
Referee: [Experiments] Table 1 / Figure 4 (if present): the reported 32 A100-hour training budget is given without breakdown of batch size, number of distillation iterations, or teacher sampling cost; this makes it impossible to assess whether the efficiency claim is reproducible or comparable to prior distillation work.

Authors: We concur that a more granular breakdown is required for reproducibility. The revised manuscript will expand the training-details paragraph (and update Table 1) to explicitly state the batch size, total number of distillation iterations, and the per-iteration teacher sampling cost, enabling direct comparison with prior distillation methods. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained via external distillation

full rationale

The paper's central construction extends Consistency Models (Song et al.) to latent space by distilling a consistency function from an external pre-trained LDM (Rombach et al.). The ODE prediction property is enforced through a separate distillation loss on the pre-trained model outputs, not by redefining the target as the input. No equations reduce the learned mapping to a fitted parameter or self-referential definition. Citations are to independent prior work with no author overlap, and performance claims rest on empirical evaluation rather than tautological derivation. This matches the default case of a non-circular method paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that guided reverse diffusion equals an augmented PF-ODE whose solution can be learned directly in latent space; training details and architectural choices are treated as standard but not enumerated in the abstract.

free parameters (2)

number of inference steps = 2-4
The 2-4 step regime is chosen as the operating point for the consistency prediction.
distillation training budget = 32 A100 GPU hours
The reported 32 A100 GPU hours is a concrete training cost for the 768x768 model.

axioms (1)

domain assumption The guided reverse diffusion process can be viewed as solving an augmented probability flow ODE (PF-ODE)
This equivalence is invoked to justify training the model to predict the ODE solution directly rather than iterating.

invented entities (2)

Latent Consistency Model (LCM) no independent evidence
purpose: A distilled model that predicts the final latent code in few steps
New model class introduced for fast sampling from LDMs.
Latent Consistency Fine-tuning (LCF) no independent evidence
purpose: Specialized fine-tuning procedure for adapting LCMs to custom datasets
New adaptation method proposed alongside the base LCM.

pith-pipeline@v0.9.0 · 5514 in / 1591 out tokens · 123214 ms · 2026-05-13T04:06:57.073035+00:00 · methodology

discussion (0)

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
cs.CV 2026-05 unverdicted novelty 8.0

CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance
cs.CV 2026-04 unverdicted novelty 7.0

CoEdit is a zero-shot coopetitive framework for text-guided image editing that uses dual-entropy attention manipulation and entropic latent refinement to improve editing harmony and structural preservation.
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
cs.CV 2026-04 conditional novelty 7.0

1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
cs.CV 2026-04 conditional novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
One Step Diffusion via Shortcut Models
cs.LG 2024-10 conditional novelty 7.0

Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
Fast Image Super-Resolution via Consistency Rectified Flow
cs.CV 2026-05 unverdicted novelty 6.0

FlowSR enables single-step image super-resolution by learning a rectified flow from LR to HR with consistency distillation, HR regularization, and dual fast-slow timestep scheduling.
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
cs.CV 2026-05 unverdicted novelty 6.0

FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
Gradient-Free Noise Optimization for Reward Alignment in Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.
Gradient-Free Noise Optimization for Reward Alignment in Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

ZeNO formulates noise optimization for reward alignment as a path-integral control problem solvable via zeroth-order reward evaluations alone, connecting to Langevin dynamics under an Ornstein-Uhlenbeck process.
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
cs.CV 2026-05 unverdicted novelty 6.0

FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
cs.CV 2026-05 unverdicted novelty 6.0

FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
cs.CV 2026-04 unverdicted novelty 6.0

MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows
cs.CV 2026-04 unverdicted novelty 6.0

Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
cs.CV 2026-04 unverdicted novelty 6.0

By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
IncreFA: Breaking the Static Wall of Generative Model Attribution
cs.CV 2026-04 unverdicted novelty 6.0

IncreFA uses hierarchical constraints with learnable orthogonal priors and a latent memory bank to enable continual adaptation for attributing images to new generative models, reporting SOTA accuracy and 98.93% unseen...
Towards Design Compositing
cs.CV 2026-04 unverdicted novelty 6.0

GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
cs.CV 2026-04 unverdicted novelty 6.0

RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
Self-Adversarial One Step Generation via Condition Shifting
cs.CV 2026-04 unverdicted novelty 6.0

APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models
cs.CY 2026-04 conditional novelty 6.0

BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop
cs.CV 2026-04 unverdicted novelty 6.0

ExpressEdit delivers fast, artifact-free stylized facial expression editing inside Photoshop via a diffusion model plugin and an accompanying expression database.
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
eess.AS 2024-06 unverdicted novelty 6.0

Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching
cs.LG 2026-05 unverdicted novelty 5.0

SharpEuler estimates a sharpness profile via finite differences on calibration trajectories, smooths it, and applies a quantile transform to generate adaptive timestep grids that improve Euler sampling quality in flow...
Identity-Decoupled Anonymization for Visual Evidence in Multi-modal Retrieval-Augmented Generation
cs.CV 2026-04 unverdicted novelty 5.0

Proposes a three-part generative anonymization pipeline using disentangled variational encoding, manifold-aware identity replacement, and distilled latent diffusion to protect face identities in MRAG while preserving ...
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
cs.CV 2026-04 unverdicted novelty 5.0

PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.
Reward-Aware Trajectory Shaping for Few-step Visual Generation
cs.CV 2026-04 unverdicted novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 33 Pith papers · 12 internal anchors

[1]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Desire: Distant future prediction in dynamic scenes with interacting agents , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[2]

arXiv preprint arXiv:1907.04967 , year=

Diverse trajectory forecasting with determinantal point processes , author=. arXiv preprint arXiv:1907.04967 , year=

work page arXiv 1907
[3]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Social gan: Socially acceptable trajectories with generative adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[4]

2019 International Conference on Robotics and Automation (ICRA) , pages=

Multimodal trajectory predictions for autonomous driving using deep convolutional networks , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

work page 2019
[5]

European Conference on Computer Vision , pages=

Learning lane graph representations for motion forecasting , author=. European Conference on Computer Vision , pages=. 2020 , organization=

work page 2020
[6]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Densetnt: End-to-end trajectory prediction from dense goal sets , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[7]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Diverse generation for multi-agent sports games , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[9]

arXiv preprint arXiv:1902.09641 , year=

Stochastic prediction of multi-agent interactions from partial observations , author=. arXiv preprint arXiv:1902.09641 , year=

work page arXiv 1902
[10]

Advances in neural information processing systems , volume=

Improved training of wasserstein gans , author=. Advances in neural information processing systems , volume=

work page
[11]

Advances in Neural Information Processing Systems , volume=

Multiple futures prediction , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

work page
[13]

Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,

Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction , author=. arXiv preprint arXiv:1910.05449 , year=

work page arXiv 1910
[14]

arXiv preprint arXiv:2111.14973 , year=

Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction , author=. arXiv preprint arXiv:2111.14973 , year=

work page arXiv
[15]

Conference on Robot Learning , pages=

Tnt: Target-driven trajectory prediction , author=. Conference on Robot Learning , pages=. 2021 , organization=

work page 2021
[16]

Advances in Neural Information Processing Systems , volume=

Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

International Conference on Machine Learning , pages=

Improved denoising diffusion probabilistic models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[21]

Advances in Neural Information Processing Systems , volume=

Generative modeling by estimating gradients of the data distribution , author=. Advances in Neural Information Processing Systems , volume=

work page
[22]

Advances in Neural Information Processing Systems , volume=

Maximum likelihood training of score-based diffusion models , author=. Advances in Neural Information Processing Systems , volume=

work page
[23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Argoverse: 3d tracking and forecasting with rich maps , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[27]

Advances in neural information processing systems , volume=

Learning structured output representation using deep conditional generative models , author=. Advances in neural information processing systems , volume=

work page
[28]

Communications of the ACM , volume=

Generative adversarial networks , author=. Communications of the ACM , volume=. 2020 , publisher=

work page 2020
[29]

arXiv preprint arXiv:2112.07068 , year=

Score-based generative modeling with critically-damped langevin diffusion , author=. arXiv preprint arXiv:2112.07068 , year=

work page arXiv
[35]

stat , volume=

Truncated diffusion probabilistic models , author=. stat , volume=

work page
[37]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

work page
[38]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Clipscore: A reference-free evaluation metric for image captioning , author=. arXiv preprint arXiv:2104.08718 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

On distillation of guided diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[43]

Advances in Neural Information Processing Systems , volume=

Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=

work page
[44]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[45]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[47]

International Conference on Machine Learning , pages=

Fast sampling of diffusion models via operator learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[50]

, author=

Estimation of non-normalized statistical models by score matching. , author=. Journal of Machine Learning Research , volume=

work page
[51]

Neural computation , volume=

A connection between score matching and denoising autoencoders , author=. Neural computation , volume=. 2011 , publisher=

work page 2011
[52]

Pinkney, Justin N. M. , title =. 2022 , howpublished=

work page 2022
[53]

2022 , howpublished=

Norod78 , title =. 2022 , howpublished=

work page 2022
[55]

Advances in Neural Information Processing Systems , volume=

Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[58]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

work page 2009
[59]

Generative adversarial networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63 0 (11): 0 139--144, 2020

work page 2020
[60]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

work page 2020
[62]

Estimation of non-normalized statistical models by score matching

Aapo Hyv \"a rinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6 0 (4), 2005

work page 2005
[63]

Jolicoeur-Martineau, K

Alexia Jolicoeur-Martineau, Ke Li, R \'e mi Pich \'e -Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021

work page arXiv 2021
[64]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35: 0 26565--26577, 2022

work page 2022
[65]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[66]

On the variance of the adaptive learning rate and beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019

work page arXiv 1908
[67]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023

work page arXiv 2023
[69]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022 a

work page arXiv 2022
[70]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022 b

work page arXiv 2022
[71]

Accelerating diffusion models via early stop of the diffusion process

Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process. arXiv preprint arXiv:2205.12524, 2022

work page arXiv 2022
[72]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14297--14306, 2023

work page 2023
[73]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[74]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.\ 8162--8171. PMLR, 2021

work page 2021
[75]

Simpsons blip captions

Norod78. Simpsons blip captions. https://huggingface.co/datasets/Norod78/simpsons-blip-captions, 2022

work page 2022
[76]

Justin N. M. Pinkney. Pokemon blip captions. https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions/, 2022

work page 2022
[77]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[78]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022

work page 2022
[79]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 0 36479--36494, 2022

work page 2022
[80]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[81]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

work page internal anchor Pith review arXiv 2022
[82]

Learning structured output representation using deep conditional generative models

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015

work page 2015
[83]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020 a

work page internal anchor Pith review Pith/arXiv arXiv 2010
[84]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[85]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020 b

work page internal anchor Pith review Pith/arXiv arXiv 2011
[86]

Maximum likelihood training of score-based diffusion models

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34: 0 1415--1428, 2021

work page 2021
[87]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review arXiv 2023
[88]

Watson, J

Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efficiently sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802, 2021

work page arXiv 2021
[89]

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015

work page internal anchor Pith review arXiv 2015
[90]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

work page arXiv 2023
[91]

Fast sampling of diffusion models via operator learning

Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pp.\ 42390--42402. PMLR, 2023

work page 2023
[92]

Truncated diffusion probabilistic models

Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. stat, 1050: 0 7, 2022

work page 2022