Improved Techniques for Training Consistency Models
Pith reviewed 2026-05-21 04:59 UTC · model grok-4.3
The pith
Consistency models reach FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64x64 in a single sampling step by training directly from data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Consistency models trained directly from data without distillation can surpass prior consistency training and distillation approaches by eliminating exponential moving average from the teacher model, adopting Pseudo-Huber losses, using a lognormal noise schedule, and doubling total discretization steps every set number of training iterations, achieving FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64×64 in one step and 2.24 and 2.77 in two steps.
What carries the argument
Elimination of exponential moving average from the teacher consistency model, which previously introduced a flaw in the consistency training objective.
If this is right
- Consistency models can exceed the sample quality of their distilled counterparts in both one-step and two-step settings.
- Direct training from data removes the upper bound set by any pre-trained diffusion model.
- The approach narrows the gap to other state-of-the-art generative models while keeping one- or two-step sampling speed.
Where Pith is reading between the lines
- Similar removal of moving averages or teacher-model artifacts could improve related score-based or flow-matching generative methods.
- The lognormal schedule and periodic discretization doubling might transfer to other consistency or diffusion training objectives on non-image data.
- If the gains hold, consistency models could become a default choice for applications needing both speed and quality without a separate distillation stage.
Load-bearing premise
The use of exponential moving average in the teacher consistency model is the primary bottleneck, and removing it along with the new loss and noise schedule will produce stable gains across datasets and architectures.
What would settle it
Training the improved consistency models on additional image datasets or network architectures and observing no improvement or new instabilities in FID scores would show the changes do not generalize as claimed.
read the original abstract
Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training. Current consistency models achieve optimal sample quality by distilling from pre-trained diffusion models and employing learned metrics such as LPIPS. However, distillation limits the quality of consistency models to that of the pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation. To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. We delve into the theory behind consistency training and identify a previously overlooked flaw, which we address by eliminating Exponential Moving Average from the teacher consistency model. To replace learned metrics like LPIPS, we adopt Pseudo-Huber losses from robust statistics. Additionally, we introduce a lognormal noise schedule for the consistency training objective, and propose to double total discretization steps every set number of training iterations. Combined with better hyperparameter tuning, these modifications enable consistency models to achieve FID scores of 2.51 and 3.25 on CIFAR-10 and ImageNet $64\times 64$ respectively in a single sampling step. These scores mark a 3.5$\times$ and 4$\times$ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.24 and 2.77 on these two datasets, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and other state-of-the-art generative models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces improved techniques for consistency training of generative models without distillation from diffusion models. Key changes include removing EMA from the teacher consistency model to address a previously overlooked flaw, adopting Pseudo-Huber loss in place of LPIPS, using a lognormal noise schedule, and doubling the number of discretization steps at regular intervals. Combined with hyperparameter tuning, these yield one-step FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64×64 (3.5× and 4× better than prior consistency training), with two-step sampling further improving to 2.24 and 2.77, surpassing distillation baselines in both settings.
Significance. If the empirical gains hold under the proposed modifications, the work meaningfully advances one-step generative modeling by demonstrating that consistency models can reach competitive quality directly from data. The theoretical identification of the EMA issue in the teacher model provides useful grounding. Direct comparisons to baselines and the reported FID improvements support the central claims, though the role of hyperparameter tuning requires clearer isolation for full attribution.
major comments (1)
- [Abstract and §4] Abstract and §4 (Experiments): The results are qualified as arising from the proposed modifications 'combined with better hyperparameter tuning.' Without explicit ablations that apply equivalent hyperparameter search and tuning effort to the original consistency-training baseline (keeping EMA, LPIPS, etc.), it remains unclear whether the 3.5×/4× FID reductions are primarily driven by removing EMA, Pseudo-Huber loss, the lognormal schedule, and discretization doubling, or largely by the tuning itself. This attribution is load-bearing for the paper's central contribution claim.
minor comments (2)
- [§3.2] §3.2: The precise parameterization of the lognormal noise schedule (mean and variance) should be stated explicitly alongside the discretization doubling interval to aid reproducibility.
- [Table 1 and Figure 2] Table 1 and Figure 2: Ensure all baseline FID numbers are obtained under identical evaluation protocols (e.g., same number of samples and classifier-free guidance settings) for fair comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comment point-by-point below and describe the revisions we will make to strengthen the attribution of results.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The results are qualified as arising from the proposed modifications 'combined with better hyperparameter tuning.' Without explicit ablations that apply equivalent hyperparameter search and tuning effort to the original consistency-training baseline (keeping EMA, LPIPS, etc.), it remains unclear whether the 3.5×/4× FID reductions are primarily driven by removing EMA, Pseudo-Huber loss, the lognormal schedule, and discretization doubling, or largely by the tuning itself. This attribution is load-bearing for the paper's central contribution claim.
Authors: We agree that the current presentation leaves room for ambiguity in attributing the FID gains to the proposed techniques versus hyperparameter tuning. The baseline numbers are taken directly from the original consistency training paper using the hyperparameters reported therein. Our improvements include both algorithmic changes (EMA removal to fix the identified flaw, Pseudo-Huber loss, lognormal schedule, and doubling of discretization steps) and more extensive tuning. In the revised manuscript we will add explicit ablation experiments that re-train the original consistency training baseline (retaining EMA, LPIPS, etc.) under an equivalent hyperparameter search budget. These new results will be reported in §4 alongside the existing tables to better isolate the contribution of each change. The theoretical analysis in §3 already shows why EMA removal addresses a fundamental inconsistency in the teacher model; the additional ablations will provide the requested empirical separation. revision: yes
Circularity Check
Minor self-citation to prior consistency model work but central claims remain empirically independent
full rationale
The paper identifies a flaw in prior consistency training (EMA in the teacher model) and introduces new components: Pseudo-Huber loss, lognormal noise schedule, doubled discretization steps, and hyperparameter tuning. These yield reported FID improvements on CIFAR-10 and ImageNet 64x64, validated against external baselines. No derivation reduces a prediction to a fitted input by construction, nor does any uniqueness theorem or ansatz smuggle in prior results as forced. Self-citation to the authors' earlier consistency models paper provides the baseline for comparison but is not load-bearing for the new empirical modifications or results. The chain is self-contained with independent content.
Axiom & Free-Parameter Ledger
free parameters (2)
- lognormal noise schedule parameters
- discretization doubling interval
axioms (1)
- domain assumption There exists a previously overlooked flaw in consistency training stemming from the use of exponential moving average in the teacher model.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we adopt Pseudo-Huber losses from robust statistics... d(x,y)=sqrt(||x-y||^2 + c^2)-c
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
One-Step Generative Modeling via Wasserstein Gradient Flows
W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
-
From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance
CoEdit is a zero-shot coopetitive framework for text-guided image editing that uses dual-entropy attention manipulation and entropic latent refinement to improve editing harmony and structural preservation.
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
One Step Diffusion via Shortcut Models
Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
-
DCFold: Efficient Protein Structure Generation with Single Forward Pass
DCFold achieves AlphaFold3-level protein structure prediction accuracy in a single forward pass using Dual Consistency training and a Temporal Geodesic Matching scheduler, delivering 15x inference acceleration.
-
Thermal-Only Crowd Counting with Deployment-Time Privacy Protection
A privacy-preserving thermal-only crowd counting framework extracts enhanced features from thermal images via single-step LCM denoising in a depth-to-RGB diffusion model and matches RGB-T fusion performance without RG...
-
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers
DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
-
Efficient Image Synthesis with Sphere Latent Encoder
Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.
-
FlashMol: High-Quality Molecule Generation in as Few as Four Steps
FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...
-
Physical Fidelity Reconstruction via Improved Consistency-Distilled Flow Matching for Dynamical Systems
Distilled one-step consistency model from optimal-transport flow-matching teacher reconstructs high-fidelity dynamical system flows from low-fidelity data with 12x speedup, half the parameters, and 23.1% better SSIM t...
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
Efficient Diffusion Distillation via Embedding Loss
Embedding Loss aligns feature distributions via MMD in random network embeddings to boost one-step diffusion distillation, reaching SOTA FID of 1.475 on CIFAR-10 unconditional generation.
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
-
Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning
JFDL allows pre-trained Consistency Models to perform guided image generation post-hoc by aligning flow distributions, reducing FID scores on CIFAR-10 and ImageNet without needing a teacher model.
-
Dual-End Consistency Model
DE-CM reaches state-of-the-art one-step FID of 1.70 on ImageNet 256x256 by decomposing PF-ODE trajectories into three critical sub-trajectories and using flow matching plus N2N mapping for stability.
-
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Causal Forcing initializes autoregressive diffusion students from AR teachers to recover flow maps that bidirectional teachers cannot provide, delivering 19%+ gains over Self Forcing on dynamic degree and related metrics.
-
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
The work introduces rCM, a score-regularized continuous-time consistency model that matches DMD2 quality on large models up to 14B parameters while improving diversity and enabling 1-4 step sampling.
-
Unified Video Action Model
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
-
Variance Reduction for Expectations with Diffusion Teachers
CARV introduces a hierarchical Monte Carlo estimator with amortized reuse, importance sampling, and stratification that yields 2-3x effective compute gains on diffusion-teacher pipelines while cutting gradient varianc...
-
Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations
A simplified one-step diffusion distillation uses pretrained teacher features directly for drifting loss plus a mode coverage term, achieving FID 1.58 on ImageNet-64 and 18.4 on SDXL.
-
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
-
Discrete Meanflow Training Curriculum
A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
Reference graph
Works this paper leans on
-
[1]
Tract: Denoising diffusion models with transitive closure time-distillation
David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. Tract: Denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248,
-
[2]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,
work page 2009
-
[3]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis. arXiv preprint arXiv:2105.05233,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/378a063b8fdb1db941e34f4bde584c7d-Paper.pdf. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa- tion processing systems, pp. 2672–2680,
work page 2019
-
[5]
Boot: Data-free distil- lation of denoising diffusion models with bootstrapping
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, and Joshua M Susskind. Boot: Data-free distil- lation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference tz&u Generative Modeling,
work page 2023
-
[6]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093,
-
[7]
On the variance of the adaptive learning rate and beyond,
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265,
-
[8]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. arXiv preprint arXiv:2305.18455,
-
[11]
Improved Denoising Diffusion Probabilistic Models
Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
URL https://openreview. net/forum?id=TIdIXIpzhoI. Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neu...
work page 2016
-
[13]
Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger
URL https://proceedings.neurips.cc/ paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html. Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. Advances in Neural Information Processing Systems, 34:17480–17492,
work page 2016
-
[14]
Stylegan-xl: Scaling stylegan to large diverse datasets
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1–10,
work page 2022
-
[15]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[16]
Improved techniques for training score-based generative models
Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,
work page 2020
-
[17]
Qinsheng Zhang and Yongxin Chen
URL https://openreview.net/forum? id=voV_TRqcWh. Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902,
-
[18]
Unipc: A unified predictor- corrector framework for fast sampling of diffusion models
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor- corrector framework for fast sampling of diffusion models. arXiv preprint arXiv:2302.04867,
-
[19]
Fast sampling of diffusion models via operator learning
Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449,
-
[20]
We use a dropout rate of 0.3 for all consistency models on CIFAR-10
For iCT-deep models in Tables 2 and 3, we double the depth of base architectures by increasing the number of residual blocks per resolution from 4 and 3 to 8 and 6 for CIFAR-10 and ImageNet64 ˆ64 respectively. We use a dropout rate of 0.3 for all consistency models on CIFAR-10. For ImageNet 64 ˆ 64, we use a dropout rate of 0.2, but only apply them to con...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.