pith. machine review for the scientific record. sign in

arxiv: 2310.04378 · v1 · submitted 2023-10-06 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:06 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords latent consistency modelsfew-step inferencetext-to-image generationdiffusion modelsimage synthesisprobability flow ODEdistillation
0
0 comments X

The pith

Latent Consistency Models enable high-resolution image synthesis in 2 to 4 inference steps by directly predicting ODE solutions in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Latent Consistency Models that treat the guided reverse diffusion process as solving an augmented probability flow ODE and directly predict its solution in latent space. This removes the need for dozens of iterative steps while preserving high fidelity when distilled from pre-trained classifier-free guided diffusion models. A 768 by 768 LCM can be trained in 32 A100 GPU hours and supports rapid sampling. The authors also present Latent Consistency Fine-tuning to adapt the models to custom image datasets. If the approach holds, text-to-image generation becomes far less computationally demanding and more practical for everyday use.

Core claim

Latent Consistency Models are designed to directly predict the solution of the augmented probability flow ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling on any pre-trained LDM including Stable Diffusion after efficient distillation from classifier-free guided models.

What carries the argument

Latent Consistency Model that directly predicts the solution of the augmented probability flow ODE in latent space.

If this is right

  • High-quality 768 by 768 images can be produced with only 2 to 4 sampling steps.
  • Training a capable LCM requires just 32 A100 GPU hours.
  • State-of-the-art text-to-image results are achievable under few-step inference constraints.
  • Latent Consistency Fine-tuning adapts the models to specialized image collections with low additional cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interactive or real-time image generation tools could run on consumer hardware without server-scale resources.
  • The same distillation pattern might apply to other latent-space generative tasks such as video or 3D synthesis.
  • Widespread adoption would reduce total energy use for large-scale image generation services.

Load-bearing premise

The consistency property learned via distillation in latent space will preserve high visual fidelity and text alignment across diverse prompts without iterative refinement.

What would settle it

A controlled test on the LAION-5B-Aesthetics dataset showing that 2-step LCM outputs have substantially higher FID scores or lower human preference ratings for quality and prompt alignment than 50-step baseline LDM outputs on the same prompts would falsify the claim.

read the original abstract

Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Latent Consistency Models (LCMs) obtained by distilling a consistency function from pre-trained classifier-free guided latent diffusion models (LDMs) such as Stable Diffusion. By framing the guided reverse process as an augmented probability-flow ODE in latent space, LCMs are designed to map any latent point directly to the clean image in 2-4 steps. The authors report that a 768×768 LCM can be trained in 32 A100 GPU hours and claim state-of-the-art text-to-image performance on the LAION-5B-Aesthetics dataset; they additionally propose Latent Consistency Fine-tuning (LCF) for dataset-specific adaptation.

Significance. If the empirical results are robust, the work would be significant: it reduces the inference cost of high-resolution diffusion models by roughly an order of magnitude while preserving quality, directly addressing the primary practical limitation of LDMs. The low reported training budget and the introduction of a fine-tuning procedure further increase the potential impact for both research and deployment.

major comments (3)
  1. [Experiments] Experiments section: the central SOTA claim is supported only by qualitative examples and aggregate statements; no quantitative tables compare FID, CLIP score, or human preference against strong few-step baselines (e.g., distilled Consistency Models, progressive distillation, or SD with 4-step DPM-Solver) on the same LAION-5B-Aesthetics split. Without these numbers the performance assertion cannot be verified.
  2. [Method] Method (distillation procedure): the consistency loss is applied after folding classifier-free guidance into the PF-ODE, yet no ablation quantifies how well the learned consistency function preserves alignment at guidance scales >7.5 or on out-of-distribution prompts. This directly tests the skeptic concern that iterative error correction is being replaced by an unverified generalization assumption.
  3. [Experiments] Table 1 / Figure 4 (if present): the reported 32 A100-hour training budget is given without breakdown of batch size, number of distillation iterations, or teacher sampling cost; this makes it impossible to assess whether the efficiency claim is reproducible or comparable to prior distillation work.
minor comments (2)
  1. [Method] Notation: the augmented PF-ODE is introduced without an explicit equation number; adding Eq. (X) for the guided velocity field would clarify how the consistency target is constructed.
  2. The project page link is given but the manuscript does not state whether code or checkpoints will be released, which is standard for reproducibility in this area.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central SOTA claim is supported only by qualitative examples and aggregate statements; no quantitative tables compare FID, CLIP score, or human preference against strong few-step baselines (e.g., distilled Consistency Models, progressive distillation, or SD with 4-step DPM-Solver) on the same LAION-5B-Aesthetics split. Without these numbers the performance assertion cannot be verified.

    Authors: We agree that quantitative metrics are essential to rigorously support the state-of-the-art claim. In the revised manuscript we will add a dedicated table reporting FID and CLIP scores for LCMs against the suggested few-step baselines (distilled Consistency Models, progressive distillation, and 4-step DPM-Solver) on the identical LAION-5B-Aesthetics evaluation split. These metrics have been computed and will be included to allow direct verification of the performance assertions. revision: yes

  2. Referee: [Method] Method (distillation procedure): the consistency loss is applied after folding classifier-free guidance into the PF-ODE, yet no ablation quantifies how well the learned consistency function preserves alignment at guidance scales >7.5 or on out-of-distribution prompts. This directly tests the skeptic concern that iterative error correction is being replaced by an unverified generalization assumption.

    Authors: This is a fair point about potential limitations in generalization. We will add a new ablation subsection (and corresponding figure) that systematically evaluates text-image alignment at guidance scales from 5 to 15 and on a set of out-of-distribution prompts. The results will quantify how well the distilled consistency function maintains prompt adherence without relying on iterative refinement. revision: yes

  3. Referee: [Experiments] Table 1 / Figure 4 (if present): the reported 32 A100-hour training budget is given without breakdown of batch size, number of distillation iterations, or teacher sampling cost; this makes it impossible to assess whether the efficiency claim is reproducible or comparable to prior distillation work.

    Authors: We concur that a more granular breakdown is required for reproducibility. The revised manuscript will expand the training-details paragraph (and update Table 1) to explicitly state the batch size, total number of distillation iterations, and the per-iteration teacher sampling cost, enabling direct comparison with prior distillation methods. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained via external distillation

full rationale

The paper's central construction extends Consistency Models (Song et al.) to latent space by distilling a consistency function from an external pre-trained LDM (Rombach et al.). The ODE prediction property is enforced through a separate distillation loss on the pre-trained model outputs, not by redefining the target as the input. No equations reduce the learned mapping to a fitted parameter or self-referential definition. Citations are to independent prior work with no author overlap, and performance claims rest on empirical evaluation rather than tautological derivation. This matches the default case of a non-circular method paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that guided reverse diffusion equals an augmented PF-ODE whose solution can be learned directly in latent space; training details and architectural choices are treated as standard but not enumerated in the abstract.

free parameters (2)
  • number of inference steps = 2-4
    The 2-4 step regime is chosen as the operating point for the consistency prediction.
  • distillation training budget = 32 A100 GPU hours
    The reported 32 A100 GPU hours is a concrete training cost for the 768x768 model.
axioms (1)
  • domain assumption The guided reverse diffusion process can be viewed as solving an augmented probability flow ODE (PF-ODE)
    This equivalence is invoked to justify training the model to predict the ODE solution directly rather than iterating.
invented entities (2)
  • Latent Consistency Model (LCM) no independent evidence
    purpose: A distilled model that predicts the final latent code in few steps
    New model class introduced for fast sampling from LDMs.
  • Latent Consistency Fine-tuning (LCF) no independent evidence
    purpose: Specialized fine-tuning procedure for adapting LCMs to custom datasets
    New adaptation method proposed alongside the base LCM.

pith-pipeline@v0.9.0 · 5514 in / 1591 out tokens · 123214 ms · 2026-05-13T04:06:57.073035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.

  2. STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.

  3. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.

  4. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

  5. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...

  6. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  7. From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

    cs.CV 2026-04 unverdicted novelty 7.0

    CoEdit is a zero-shot coopetitive framework for text-guided image editing that uses dual-entropy attention manipulation and entropic latent refinement to improve editing harmony and structural preservation.

  8. 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

    cs.CV 2026-04 conditional novelty 7.0

    1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.

  9. Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

    cs.CV 2026-04 conditional novelty 7.0

    SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

  10. One Step Diffusion via Shortcut Models

    cs.LG 2024-10 conditional novelty 7.0

    Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.

  11. Fast Image Super-Resolution via Consistency Rectified Flow

    cs.CV 2026-05 unverdicted novelty 6.0

    FlowSR enables single-step image super-resolution by learning a rectified flow from LR to HR with consistency distillation, HR regularization, and dual fast-slow timestep scheduling.

  12. FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

    cs.CV 2026-05 unverdicted novelty 6.0

    FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.

  13. Gradient-Free Noise Optimization for Reward Alignment in Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.

  14. Gradient-Free Noise Optimization for Reward Alignment in Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeNO formulates noise optimization for reward alignment as a path-integral control problem solvable via zeroth-order reward evaluations alone, connecting to Langevin dynamics under an Ornstein-Uhlenbeck process.

  15. FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.

  16. FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...

  17. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  18. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

  19. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.

  20. PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.

  21. MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution

    cs.CV 2026-04 unverdicted novelty 6.0

    MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.

  22. Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows

    cs.CV 2026-04 unverdicted novelty 6.0

    Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.

  23. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  24. IncreFA: Breaking the Static Wall of Generative Model Attribution

    cs.CV 2026-04 unverdicted novelty 6.0

    IncreFA uses hierarchical constraints with learnable orthogonal priors and a latent memory bank to enable continual adaptation for attributing images to new generative models, reporting SOTA accuracy and 98.93% unseen...

  25. Towards Design Compositing

    cs.CV 2026-04 unverdicted novelty 6.0

    GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.

  26. DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

    cs.CV 2026-04 unverdicted novelty 6.0

    RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...

  27. Self-Adversarial One Step Generation via Condition Shifting

    cs.CV 2026-04 unverdicted novelty 6.0

    APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.

  28. BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models

    cs.CY 2026-04 conditional novelty 6.0

    BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.

  29. VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

  30. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  31. ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop

    cs.CV 2026-04 unverdicted novelty 6.0

    ExpressEdit delivers fast, artifact-free stylized facial expression editing inside Photoshop via a diffusion model plugin and an accompanying expression database.

  32. Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.

  33. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  34. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    eess.AS 2024-06 unverdicted novelty 6.0

    Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.

  35. Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching

    cs.LG 2026-05 unverdicted novelty 5.0

    SharpEuler estimates a sharpness profile via finite differences on calibration trajectories, smooths it, and applies a quantile transform to generate adaptive timestep grids that improve Euler sampling quality in flow...

  36. Identity-Decoupled Anonymization for Visual Evidence in Multi-modal Retrieval-Augmented Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Proposes a three-part generative anonymization pipeline using disentangled variational encoding, manifold-aware identity replacement, and distilled latent diffusion to protect face identities in MRAG while preserving ...

  37. PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

    cs.CV 2026-04 unverdicted novelty 5.0

    PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.

  38. Reward-Aware Trajectory Shaping for Few-step Visual Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 33 Pith papers · 12 internal anchors

  1. [1]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Desire: Distant future prediction in dynamic scenes with interacting agents , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  2. [2]

    arXiv preprint arXiv:1907.04967 , year=

    Diverse trajectory forecasting with determinantal point processes , author=. arXiv preprint arXiv:1907.04967 , year=

  3. [3]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Social gan: Socially acceptable trajectories with generative adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  4. [4]

    2019 International Conference on Robotics and Automation (ICRA) , pages=

    Multimodal trajectory predictions for autonomous driving using deep convolutional networks , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

  5. [5]

    European Conference on Computer Vision , pages=

    Learning lane graph representations for motion forecasting , author=. European Conference on Computer Vision , pages=. 2020 , organization=

  6. [6]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Densetnt: End-to-end trajectory prediction from dense goal sets , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  7. [7]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  8. [8]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Diverse generation for multi-agent sports games , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  9. [9]

    arXiv preprint arXiv:1902.09641 , year=

    Stochastic prediction of multi-agent interactions from partial observations , author=. arXiv preprint arXiv:1902.09641 , year=

  10. [10]

    Advances in neural information processing systems , volume=

    Improved training of wasserstein gans , author=. Advances in neural information processing systems , volume=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Multiple futures prediction , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Proceedings of the European Conference on Computer Vision (ECCV) , pages=

    R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

  13. [13]

    Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,

    Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction , author=. arXiv preprint arXiv:1910.05449 , year=

  14. [14]

    arXiv preprint arXiv:2111.14973 , year=

    Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction , author=. arXiv preprint arXiv:2111.14973 , year=

  15. [15]

    Conference on Robot Learning , pages=

    Tnt: Target-driven trajectory prediction , author=. Conference on Robot Learning , pages=. 2021 , organization=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=

  17. [18]

    International Conference on Machine Learning , pages=

    Improved denoising diffusion probabilistic models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  18. [19]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  19. [21]

    Advances in Neural Information Processing Systems , volume=

    Generative modeling by estimating gradients of the data distribution , author=. Advances in Neural Information Processing Systems , volume=

  20. [22]

    Advances in Neural Information Processing Systems , volume=

    Maximum likelihood training of score-based diffusion models , author=. Advances in Neural Information Processing Systems , volume=

  21. [23]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Argoverse: 3d tracking and forecasting with rich maps , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  22. [27]

    Advances in neural information processing systems , volume=

    Learning structured output representation using deep conditional generative models , author=. Advances in neural information processing systems , volume=

  23. [28]

    Communications of the ACM , volume=

    Generative adversarial networks , author=. Communications of the ACM , volume=. 2020 , publisher=

  24. [29]

    arXiv preprint arXiv:2112.07068 , year=

    Score-based generative modeling with critically-damped langevin diffusion , author=. arXiv preprint arXiv:2112.07068 , year=

  25. [35]

    stat , volume=

    Truncated diffusion probabilistic models , author=. stat , volume=

  26. [37]

    Advances in neural information processing systems , volume=

    Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

  27. [38]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Clipscore: A reference-free evaluation metric for image captioning , author=. arXiv preprint arXiv:2104.08718 , year=

  28. [39]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  29. [40]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    On distillation of guided diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  30. [43]

    Advances in Neural Information Processing Systems , volume=

    Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=

  31. [44]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  32. [45]

    2009 IEEE conference on computer vision and pattern recognition , pages=

    Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

  33. [47]

    International Conference on Machine Learning , pages=

    Fast sampling of diffusion models via operator learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  34. [50]

    , author=

    Estimation of non-normalized statistical models by score matching. , author=. Journal of Machine Learning Research , volume=

  35. [51]

    Neural computation , volume=

    A connection between score matching and denoising autoencoders , author=. Neural computation , volume=. 2011 , publisher=

  36. [52]

    Pinkney, Justin N. M. , title =. 2022 , howpublished=

  37. [53]

    2022 , howpublished=

    Norod78 , title =. 2022 , howpublished=

  38. [55]

    Advances in Neural Information Processing Systems , volume=

    Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=

  39. [58]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

  40. [59]

    Generative adversarial networks

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63 0 (11): 0 139--144, 2020

  41. [60]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  42. [61]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

  43. [62]

    Estimation of non-normalized statistical models by score matching

    Aapo Hyv \"a rinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6 0 (4), 2005

  44. [63]

    Jolicoeur-Martineau, K

    Alexia Jolicoeur-Martineau, Ke Li, R \'e mi Pich \'e -Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021

  45. [64]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35: 0 26565--26577, 2022

  46. [65]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  47. [66]

    On the variance of the adaptive learning rate and beyond

    Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019

  48. [67]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

  49. [68]

    Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023

  50. [69]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022 a

  51. [70]

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022 b

  52. [71]

    Accelerating diffusion models via early stop of the diffusion process

    Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process. arXiv preprint arXiv:2205.12524, 2022

  53. [72]

    On distillation of guided diffusion models

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14297--14306, 2023

  54. [73]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

  55. [74]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.\ 8162--8171. PMLR, 2021

  56. [75]

    Simpsons blip captions

    Norod78. Simpsons blip captions. https://huggingface.co/datasets/Norod78/simpsons-blip-captions, 2022

  57. [76]

    Justin N. M. Pinkney. Pokemon blip captions. https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions/, 2022

  58. [77]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  59. [78]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022

  60. [79]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 0 36479--36494, 2022

  61. [80]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

  62. [81]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

  63. [82]

    Learning structured output representation using deep conditional generative models

    Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015

  64. [83]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020 a

  65. [84]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019

  66. [85]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020 b

  67. [86]

    Maximum likelihood training of score-based diffusion models

    Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34: 0 1415--1428, 2021

  68. [87]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023

  69. [88]

    Watson, J

    Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efficiently sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802, 2021

  70. [89]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015

  71. [90]

    Adding conditional control to text-to-image diffusion models,

    Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

  72. [91]

    Fast sampling of diffusion models via operator learning

    Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pp.\ 42390--42402. PMLR, 2023

  73. [92]

    Truncated diffusion probabilistic models

    Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. stat, 1050: 0 7, 2022