Recognition: no theorem link
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Pith reviewed 2026-05-13 04:06 UTC · model grok-4.3
The pith
Latent Consistency Models enable high-resolution image synthesis in 2 to 4 inference steps by directly predicting ODE solutions in latent space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Latent Consistency Models are designed to directly predict the solution of the augmented probability flow ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling on any pre-trained LDM including Stable Diffusion after efficient distillation from classifier-free guided models.
What carries the argument
Latent Consistency Model that directly predicts the solution of the augmented probability flow ODE in latent space.
If this is right
- High-quality 768 by 768 images can be produced with only 2 to 4 sampling steps.
- Training a capable LCM requires just 32 A100 GPU hours.
- State-of-the-art text-to-image results are achievable under few-step inference constraints.
- Latent Consistency Fine-tuning adapts the models to specialized image collections with low additional cost.
Where Pith is reading between the lines
- Interactive or real-time image generation tools could run on consumer hardware without server-scale resources.
- The same distillation pattern might apply to other latent-space generative tasks such as video or 3D synthesis.
- Widespread adoption would reduce total energy use for large-scale image generation services.
Load-bearing premise
The consistency property learned via distillation in latent space will preserve high visual fidelity and text alignment across diverse prompts without iterative refinement.
What would settle it
A controlled test on the LAION-5B-Aesthetics dataset showing that 2-step LCM outputs have substantially higher FID scores or lower human preference ratings for quality and prompt alignment than 50-step baseline LDM outputs on the same prompts would falsify the claim.
read the original abstract
Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Latent Consistency Models (LCMs) obtained by distilling a consistency function from pre-trained classifier-free guided latent diffusion models (LDMs) such as Stable Diffusion. By framing the guided reverse process as an augmented probability-flow ODE in latent space, LCMs are designed to map any latent point directly to the clean image in 2-4 steps. The authors report that a 768×768 LCM can be trained in 32 A100 GPU hours and claim state-of-the-art text-to-image performance on the LAION-5B-Aesthetics dataset; they additionally propose Latent Consistency Fine-tuning (LCF) for dataset-specific adaptation.
Significance. If the empirical results are robust, the work would be significant: it reduces the inference cost of high-resolution diffusion models by roughly an order of magnitude while preserving quality, directly addressing the primary practical limitation of LDMs. The low reported training budget and the introduction of a fine-tuning procedure further increase the potential impact for both research and deployment.
major comments (3)
- [Experiments] Experiments section: the central SOTA claim is supported only by qualitative examples and aggregate statements; no quantitative tables compare FID, CLIP score, or human preference against strong few-step baselines (e.g., distilled Consistency Models, progressive distillation, or SD with 4-step DPM-Solver) on the same LAION-5B-Aesthetics split. Without these numbers the performance assertion cannot be verified.
- [Method] Method (distillation procedure): the consistency loss is applied after folding classifier-free guidance into the PF-ODE, yet no ablation quantifies how well the learned consistency function preserves alignment at guidance scales >7.5 or on out-of-distribution prompts. This directly tests the skeptic concern that iterative error correction is being replaced by an unverified generalization assumption.
- [Experiments] Table 1 / Figure 4 (if present): the reported 32 A100-hour training budget is given without breakdown of batch size, number of distillation iterations, or teacher sampling cost; this makes it impossible to assess whether the efficiency claim is reproducible or comparable to prior distillation work.
minor comments (2)
- [Method] Notation: the augmented PF-ODE is introduced without an explicit equation number; adding Eq. (X) for the guided velocity field would clarify how the consistency target is constructed.
- The project page link is given but the manuscript does not state whether code or checkpoints will be released, which is standard for reproducibility in this area.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central SOTA claim is supported only by qualitative examples and aggregate statements; no quantitative tables compare FID, CLIP score, or human preference against strong few-step baselines (e.g., distilled Consistency Models, progressive distillation, or SD with 4-step DPM-Solver) on the same LAION-5B-Aesthetics split. Without these numbers the performance assertion cannot be verified.
Authors: We agree that quantitative metrics are essential to rigorously support the state-of-the-art claim. In the revised manuscript we will add a dedicated table reporting FID and CLIP scores for LCMs against the suggested few-step baselines (distilled Consistency Models, progressive distillation, and 4-step DPM-Solver) on the identical LAION-5B-Aesthetics evaluation split. These metrics have been computed and will be included to allow direct verification of the performance assertions. revision: yes
-
Referee: [Method] Method (distillation procedure): the consistency loss is applied after folding classifier-free guidance into the PF-ODE, yet no ablation quantifies how well the learned consistency function preserves alignment at guidance scales >7.5 or on out-of-distribution prompts. This directly tests the skeptic concern that iterative error correction is being replaced by an unverified generalization assumption.
Authors: This is a fair point about potential limitations in generalization. We will add a new ablation subsection (and corresponding figure) that systematically evaluates text-image alignment at guidance scales from 5 to 15 and on a set of out-of-distribution prompts. The results will quantify how well the distilled consistency function maintains prompt adherence without relying on iterative refinement. revision: yes
-
Referee: [Experiments] Table 1 / Figure 4 (if present): the reported 32 A100-hour training budget is given without breakdown of batch size, number of distillation iterations, or teacher sampling cost; this makes it impossible to assess whether the efficiency claim is reproducible or comparable to prior distillation work.
Authors: We concur that a more granular breakdown is required for reproducibility. The revised manuscript will expand the training-details paragraph (and update Table 1) to explicitly state the batch size, total number of distillation iterations, and the per-iteration teacher sampling cost, enabling direct comparison with prior distillation methods. revision: yes
Circularity Check
No circularity: derivation is self-contained via external distillation
full rationale
The paper's central construction extends Consistency Models (Song et al.) to latent space by distilling a consistency function from an external pre-trained LDM (Rombach et al.). The ODE prediction property is enforced through a separate distillation loss on the pre-trained model outputs, not by redefining the target as the input. No equations reduce the learned mapping to a fitted parameter or self-referential definition. Citations are to independent prior work with no author overlap, and performance claims rest on empirical evaluation rather than tautological derivation. This matches the default case of a non-circular method paper.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of inference steps =
2-4
- distillation training budget =
32 A100 GPU hours
axioms (1)
- domain assumption The guided reverse diffusion process can be viewed as solving an augmented probability flow ODE (PF-ODE)
invented entities (2)
-
Latent Consistency Model (LCM)
no independent evidence
-
Latent Consistency Fine-tuning (LCF)
no independent evidence
Forward citations
Cited by 38 Pith papers
-
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
-
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models
STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
-
Normalizing Trajectory Models
NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
-
Normalizing Trajectory Models
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance
CoEdit is a zero-shot coopetitive framework for text-guided image editing that uses dual-entropy attention manipulation and entropic latent refinement to improve editing harmony and structural preservation.
-
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
One Step Diffusion via Shortcut Models
Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
-
Fast Image Super-Resolution via Consistency Rectified Flow
FlowSR enables single-step image super-resolution by learning a rectified flow from LR to HR with consistency distillation, HR regularization, and dual fast-slow timestep scheduling.
-
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
-
Gradient-Free Noise Optimization for Reward Alignment in Generative Models
ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.
-
Gradient-Free Noise Optimization for Reward Alignment in Generative Models
ZeNO formulates noise optimization for reward alignment as a path-integral control problem solvable via zeroth-order reward evaluations alone, connecting to Langevin dynamics under an Ornstein-Uhlenbeck process.
-
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.
-
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
-
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
-
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
-
Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows
Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
-
IncreFA: Breaking the Static Wall of Generative Model Attribution
IncreFA uses hierarchical constraints with learnable orthogonal priors and a latent memory bank to enable continual adaptation for attributing images to new generative models, reporting SOTA accuracy and 98.93% unseen...
-
Towards Design Compositing
GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.
-
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
-
Self-Adversarial One Step Generation via Condition Shifting
APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
-
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models
BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
-
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop
ExpressEdit delivers fast, artifact-free stylized facial expression editing inside Photoshop via a diffusion model plugin and an accompanying expression database.
-
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
-
Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching
SharpEuler estimates a sharpness profile via finite differences on calibration trajectories, smooths it, and applies a quantile transform to generate adaptive timestep grids that improve Euler sampling quality in flow...
-
Identity-Decoupled Anonymization for Visual Evidence in Multi-modal Retrieval-Augmented Generation
Proposes a three-part generative anonymization pipeline using disentangled variational encoding, manifold-aware identity replacement, and distilled latent diffusion to protect face identities in MRAG while preserving ...
-
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.
-
Reward-Aware Trajectory Shaping for Few-step Visual Generation
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Desire: Distant future prediction in dynamic scenes with interacting agents , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[2]
arXiv preprint arXiv:1907.04967 , year=
Diverse trajectory forecasting with determinantal point processes , author=. arXiv preprint arXiv:1907.04967 , year=
-
[3]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Social gan: Socially acceptable trajectories with generative adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[4]
2019 International Conference on Robotics and Automation (ICRA) , pages=
Multimodal trajectory predictions for autonomous driving using deep convolutional networks , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=
work page 2019
-
[5]
European Conference on Computer Vision , pages=
Learning lane graph representations for motion forecasting , author=. European Conference on Computer Vision , pages=. 2020 , organization=
work page 2020
-
[6]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Densetnt: End-to-end trajectory prediction from dense goal sets , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[7]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[8]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Diverse generation for multi-agent sports games , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[9]
arXiv preprint arXiv:1902.09641 , year=
Stochastic prediction of multi-agent interactions from partial observations , author=. arXiv preprint arXiv:1902.09641 , year=
-
[10]
Advances in neural information processing systems , volume=
Improved training of wasserstein gans , author=. Advances in neural information processing systems , volume=
-
[11]
Advances in Neural Information Processing Systems , volume=
Multiple futures prediction , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
Proceedings of the European Conference on Computer Vision (ECCV) , pages=
R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=
-
[13]
Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction , author=. arXiv preprint arXiv:1910.05449 , year=
-
[14]
arXiv preprint arXiv:2111.14973 , year=
Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction , author=. arXiv preprint arXiv:2111.14973 , year=
-
[15]
Conference on Robot Learning , pages=
Tnt: Target-driven trajectory prediction , author=. Conference on Robot Learning , pages=. 2021 , organization=
work page 2021
-
[16]
Advances in Neural Information Processing Systems , volume=
Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=
-
[18]
International Conference on Machine Learning , pages=
Improved denoising diffusion probabilistic models , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[19]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[21]
Advances in Neural Information Processing Systems , volume=
Generative modeling by estimating gradients of the data distribution , author=. Advances in Neural Information Processing Systems , volume=
-
[22]
Advances in Neural Information Processing Systems , volume=
Maximum likelihood training of score-based diffusion models , author=. Advances in Neural Information Processing Systems , volume=
-
[23]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Argoverse: 3d tracking and forecasting with rich maps , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[27]
Advances in neural information processing systems , volume=
Learning structured output representation using deep conditional generative models , author=. Advances in neural information processing systems , volume=
-
[28]
Communications of the ACM , volume=
Generative adversarial networks , author=. Communications of the ACM , volume=. 2020 , publisher=
work page 2020
-
[29]
arXiv preprint arXiv:2112.07068 , year=
Score-based generative modeling with critically-damped langevin diffusion , author=. arXiv preprint arXiv:2112.07068 , year=
- [35]
-
[37]
Advances in neural information processing systems , volume=
Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=
-
[38]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Clipscore: A reference-free evaluation metric for image captioning , author=. arXiv preprint arXiv:2104.08718 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[40]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
On distillation of guided diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[43]
Advances in Neural Information Processing Systems , volume=
Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=
-
[44]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[45]
2009 IEEE conference on computer vision and pattern recognition , pages=
Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=
work page 2009
-
[47]
International Conference on Machine Learning , pages=
Fast sampling of diffusion models via operator learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
- [50]
-
[51]
A connection between score matching and denoising autoencoders , author=. Neural computation , volume=. 2011 , publisher=
work page 2011
-
[52]
Pinkney, Justin N. M. , title =. 2022 , howpublished=
work page 2022
- [53]
-
[55]
Advances in Neural Information Processing Systems , volume=
Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=
-
[58]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009
work page 2009
-
[59]
Generative adversarial networks
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63 0 (11): 0 139--144, 2020
work page 2020
-
[60]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020
work page 2020
-
[62]
Estimation of non-normalized statistical models by score matching
Aapo Hyv \"a rinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6 0 (4), 2005
work page 2005
-
[63]
Alexia Jolicoeur-Martineau, Ke Li, R \'e mi Pich \'e -Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021
-
[64]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35: 0 26565--26577, 2022
work page 2022
-
[65]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[66]
On the variance of the adaptive learning rate and beyond
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019
-
[67]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Instaflow: One step is enough for high-quality diffusion-based text-to-image generation
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023
-
[69]
Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022 a
-
[70]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022 b
-
[71]
Accelerating diffusion models via early stop of the diffusion process
Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process. arXiv preprint arXiv:2205.12524, 2022
-
[72]
On distillation of guided diffusion models
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14297--14306, 2023
work page 2023
-
[73]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[74]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.\ 8162--8171. PMLR, 2021
work page 2021
-
[75]
Norod78. Simpsons blip captions. https://huggingface.co/datasets/Norod78/simpsons-blip-captions, 2022
work page 2022
-
[76]
Justin N. M. Pinkney. Pokemon blip captions. https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions/, 2022
work page 2022
-
[77]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[78]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022
work page 2022
-
[79]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 0 36479--36494, 2022
work page 2022
-
[80]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[81]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022
work page internal anchor Pith review arXiv 2022
-
[82]
Learning structured output representation using deep conditional generative models
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015
work page 2015
-
[83]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020 a
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[84]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[85]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020 b
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[86]
Maximum likelihood training of score-based diffusion models
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34: 0 1415--1428, 2021
work page 2021
-
[87]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023
work page internal anchor Pith review arXiv 2023
- [88]
-
[89]
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015
work page internal anchor Pith review arXiv 2015
-
[90]
Adding conditional control to text-to-image diffusion models,
Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023
-
[91]
Fast sampling of diffusion models via operator learning
Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pp.\ 42390--42402. PMLR, 2023
work page 2023
-
[92]
Truncated diffusion probabilistic models
Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. stat, 1050: 0 7, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.