pith. sign in

arxiv: 2310.14189 · v1 · pith:DZ7PMTNAnew · submitted 2023-10-22 · 💻 cs.LG

Improved Techniques for Training Consistency Models

Pith reviewed 2026-05-21 04:59 UTC · model grok-4.3

classification 💻 cs.LG
keywords consistency modelsgenerative modelsimage synthesisFID evaluationnoise scheduletraining techniquesone-step sampling
0
0 comments X

The pith

Consistency models reach FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64x64 in a single sampling step by training directly from data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that consistency models can learn high-quality generation without distilling from pre-trained diffusion models. It identifies the use of exponential moving average in the teacher consistency model as a flaw that limits performance and replaces it with direct training improvements including Pseudo-Huber losses instead of LPIPS, a lognormal noise schedule, and periodic doubling of discretization steps. These changes, plus better hyperparameter tuning, produce the reported FID gains and allow two-step sampling to further improve results beyond what distillation achieves. A sympathetic reader would care because this removes the quality ceiling imposed by a separate teacher model and reduces reliance on learned evaluation metrics that can introduce bias.

Core claim

Consistency models trained directly from data without distillation can surpass prior consistency training and distillation approaches by eliminating exponential moving average from the teacher model, adopting Pseudo-Huber losses, using a lognormal noise schedule, and doubling total discretization steps every set number of training iterations, achieving FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64×64 in one step and 2.24 and 2.77 in two steps.

What carries the argument

Elimination of exponential moving average from the teacher consistency model, which previously introduced a flaw in the consistency training objective.

If this is right

  • Consistency models can exceed the sample quality of their distilled counterparts in both one-step and two-step settings.
  • Direct training from data removes the upper bound set by any pre-trained diffusion model.
  • The approach narrows the gap to other state-of-the-art generative models while keeping one- or two-step sampling speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar removal of moving averages or teacher-model artifacts could improve related score-based or flow-matching generative methods.
  • The lognormal schedule and periodic discretization doubling might transfer to other consistency or diffusion training objectives on non-image data.
  • If the gains hold, consistency models could become a default choice for applications needing both speed and quality without a separate distillation stage.

Load-bearing premise

The use of exponential moving average in the teacher consistency model is the primary bottleneck, and removing it along with the new loss and noise schedule will produce stable gains across datasets and architectures.

What would settle it

Training the improved consistency models on additional image datasets or network architectures and observing no improvement or new instabilities in FID scores would show the changes do not generalize as claimed.

read the original abstract

Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training. Current consistency models achieve optimal sample quality by distilling from pre-trained diffusion models and employing learned metrics such as LPIPS. However, distillation limits the quality of consistency models to that of the pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation. To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. We delve into the theory behind consistency training and identify a previously overlooked flaw, which we address by eliminating Exponential Moving Average from the teacher consistency model. To replace learned metrics like LPIPS, we adopt Pseudo-Huber losses from robust statistics. Additionally, we introduce a lognormal noise schedule for the consistency training objective, and propose to double total discretization steps every set number of training iterations. Combined with better hyperparameter tuning, these modifications enable consistency models to achieve FID scores of 2.51 and 3.25 on CIFAR-10 and ImageNet $64\times 64$ respectively in a single sampling step. These scores mark a 3.5$\times$ and 4$\times$ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.24 and 2.77 on these two datasets, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and other state-of-the-art generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces improved techniques for consistency training of generative models without distillation from diffusion models. Key changes include removing EMA from the teacher consistency model to address a previously overlooked flaw, adopting Pseudo-Huber loss in place of LPIPS, using a lognormal noise schedule, and doubling the number of discretization steps at regular intervals. Combined with hyperparameter tuning, these yield one-step FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64×64 (3.5× and 4× better than prior consistency training), with two-step sampling further improving to 2.24 and 2.77, surpassing distillation baselines in both settings.

Significance. If the empirical gains hold under the proposed modifications, the work meaningfully advances one-step generative modeling by demonstrating that consistency models can reach competitive quality directly from data. The theoretical identification of the EMA issue in the teacher model provides useful grounding. Direct comparisons to baselines and the reported FID improvements support the central claims, though the role of hyperparameter tuning requires clearer isolation for full attribution.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (Experiments): The results are qualified as arising from the proposed modifications 'combined with better hyperparameter tuning.' Without explicit ablations that apply equivalent hyperparameter search and tuning effort to the original consistency-training baseline (keeping EMA, LPIPS, etc.), it remains unclear whether the 3.5×/4× FID reductions are primarily driven by removing EMA, Pseudo-Huber loss, the lognormal schedule, and discretization doubling, or largely by the tuning itself. This attribution is load-bearing for the paper's central contribution claim.
minor comments (2)
  1. [§3.2] §3.2: The precise parameterization of the lognormal noise schedule (mean and variance) should be stated explicitly alongside the discretization doubling interval to aid reproducibility.
  2. [Table 1 and Figure 2] Table 1 and Figure 2: Ensure all baseline FID numbers are obtained under identical evaluation protocols (e.g., same number of samples and classifier-free guidance settings) for fair comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comment point-by-point below and describe the revisions we will make to strengthen the attribution of results.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The results are qualified as arising from the proposed modifications 'combined with better hyperparameter tuning.' Without explicit ablations that apply equivalent hyperparameter search and tuning effort to the original consistency-training baseline (keeping EMA, LPIPS, etc.), it remains unclear whether the 3.5×/4× FID reductions are primarily driven by removing EMA, Pseudo-Huber loss, the lognormal schedule, and discretization doubling, or largely by the tuning itself. This attribution is load-bearing for the paper's central contribution claim.

    Authors: We agree that the current presentation leaves room for ambiguity in attributing the FID gains to the proposed techniques versus hyperparameter tuning. The baseline numbers are taken directly from the original consistency training paper using the hyperparameters reported therein. Our improvements include both algorithmic changes (EMA removal to fix the identified flaw, Pseudo-Huber loss, lognormal schedule, and doubling of discretization steps) and more extensive tuning. In the revised manuscript we will add explicit ablation experiments that re-train the original consistency training baseline (retaining EMA, LPIPS, etc.) under an equivalent hyperparameter search budget. These new results will be reported in §4 alongside the existing tables to better isolate the contribution of each change. The theoretical analysis in §3 already shows why EMA removal addresses a fundamental inconsistency in the teacher model; the additional ablations will provide the requested empirical separation. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior consistency model work but central claims remain empirically independent

full rationale

The paper identifies a flaw in prior consistency training (EMA in the teacher model) and introduces new components: Pseudo-Huber loss, lognormal noise schedule, doubled discretization steps, and hyperparameter tuning. These yield reported FID improvements on CIFAR-10 and ImageNet 64x64, validated against external baselines. No derivation reduces a prediction to a fitted input by construction, nor does any uniqueness theorem or ansatz smuggle in prior results as forced. Self-citation to the authors' earlier consistency models paper provides the baseline for comparison but is not load-bearing for the new empirical modifications or results. The chain is self-contained with independent content.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from robust statistics for the loss function and generative modeling for the training setup, while introducing several new tunable elements in the training procedure.

free parameters (2)
  • lognormal noise schedule parameters
    Parameters controlling the lognormal distribution for the consistency training noise schedule, chosen or tuned during development.
  • discretization doubling interval
    The fixed number of training iterations after which the total discretization steps are doubled.
axioms (1)
  • domain assumption There exists a previously overlooked flaw in consistency training stemming from the use of exponential moving average in the teacher model.
    Identified via theoretical analysis and addressed by its removal.

pith-pipeline@v0.9.0 · 5797 in / 1404 out tokens · 49211 ms · 2026-05-21T04:59:56.459389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One-Step Generative Modeling via Wasserstein Gradient Flows

    cs.LG 2026-05 conditional novelty 7.0

    W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...

  2. From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

    cs.CV 2026-04 unverdicted novelty 7.0

    CoEdit is a zero-shot coopetitive framework for text-guided image editing that uses dual-entropy attention manipulation and entropic latent refinement to improve editing harmony and structural preservation.

  3. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  4. One Step Diffusion via Shortcut Models

    cs.LG 2024-10 conditional novelty 7.0

    Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.

  5. DCFold: Efficient Protein Structure Generation with Single Forward Pass

    cs.LG 2026-05 unverdicted novelty 6.0

    DCFold achieves AlphaFold3-level protein structure prediction accuracy in a single forward pass using Dual Consistency training and a Temporal Geodesic Matching scheduler, delivering 15x inference acceleration.

  6. Thermal-Only Crowd Counting with Deployment-Time Privacy Protection

    cs.CV 2026-05 unverdicted novelty 6.0

    A privacy-preserving thermal-only crowd counting framework extracts enhanced features from thermal images via single-step LCM denoising in a depth-to-RGB diffusion model and matches RGB-T fusion performance without RG...

  7. DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.

  8. Efficient Image Synthesis with Sphere Latent Encoder

    cs.CV 2026-05 unverdicted novelty 6.0

    Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.

  9. FlashMol: High-Quality Molecule Generation in as Few as Four Steps

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...

  10. Physical Fidelity Reconstruction via Improved Consistency-Distilled Flow Matching for Dynamical Systems

    cs.LG 2026-05 unverdicted novelty 6.0

    Distilled one-step consistency model from optimal-transport flow-matching teacher reconstructs high-fidelity dynamical system flows from low-fidelity data with 12x speedup, half the parameters, and 23.1% better SSIM t...

  11. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  12. Efficient Diffusion Distillation via Embedding Loss

    cs.CV 2026-04 unverdicted novelty 6.0

    Embedding Loss aligns feature distributions via MMD in random network embeddings to boost one-step diffusion distillation, reaching SOTA FID of 1.475 on CIFAR-10 unconditional generation.

  13. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  14. Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    JFDL allows pre-trained Consistency Models to perform guided image generation post-hoc by aligning flow distributions, reducing FID scores on CIFAR-10 and ImageNet without needing a teacher model.

  15. Dual-End Consistency Model

    cs.CV 2026-02 unverdicted novelty 6.0

    DE-CM reaches state-of-the-art one-step FID of 1.70 on ImageNet 256x256 by decomposing PF-ODE trajectories into three critical sub-trajectories and using flow matching plus N2N mapping for stability.

  16. Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    cs.CV 2026-02 conditional novelty 6.0

    Causal Forcing initializes autoregressive diffusion students from AR teachers to recover flow maps that bidirectional teachers cannot provide, delivering 19%+ gains over Self Forcing on dynamic degree and related metrics.

  17. Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

    cs.CV 2025-10 conditional novelty 6.0

    The work introduces rCM, a score-regularized continuous-time consistency model that matches DMD2 quality on large models up to 14B parameters while improving diversity and enabling 1-4 step sampling.

  18. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  19. Variance Reduction for Expectations with Diffusion Teachers

    cs.LG 2026-05 unverdicted novelty 5.0

    CARV introduces a hierarchical Monte Carlo estimator with amortized reuse, importance sampling, and stratification that yields 2-3x effective compute gains on diffusion-teacher pipelines while cutting gradient varianc...

  20. Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations

    cs.CV 2026-05 unverdicted novelty 5.0

    A simplified one-step diffusion distillation uses pretrained teacher features directly for drifting loss plus a mode coverage term, achieving FID 1.58 on ImageNet-64 and 18.4 on SDXL.

  21. SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation

    cs.LG 2026-04 unverdicted novelty 5.0

    SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.

  22. Discrete Meanflow Training Curriculum

    cs.LG 2026-04 unverdicted novelty 4.0

    A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 22 Pith papers · 5 internal anchors

  1. [1]

    Tract: Denoising diffusion models with transitive closure time-distillation

    David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. Tract: Denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248,

  2. [2]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

  3. [3]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis. arXiv preprint arXiv:2105.05233,

  4. [4]

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/378a063b8fdb1db941e34f4bde584c7d-Paper.pdf. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa- tion processing systems, pp. 2672–2680,

  5. [5]

    Boot: Data-free distil- lation of denoising diffusion models with bootstrapping

    Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, and Joshua M Susskind. Boot: Data-free distil- lation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference tz&u Generative Modeling,

  6. [6]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093,

  7. [7]

    On the variance of the adaptive learning rate and beyond,

    Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265,

  8. [8]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

  9. [9]

    Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

    Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388,

  10. [10]

    Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. arXiv preprint arXiv:2305.18455,

  11. [11]

    Improved Denoising Diffusion Probabilistic Models

    Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672,

  12. [12]

    net/forum?id=TIdIXIpzhoI

    URL https://openreview. net/forum?id=TIdIXIpzhoI. Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neu...

  13. [13]

    Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger

    URL https://proceedings.neurips.cc/ paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html. Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. Advances in Neural Information Processing Systems, 34:17480–17492,

  14. [14]

    Stylegan-xl: Scaling stylegan to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1–10,

  15. [15]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,

  16. [16]

    Improved techniques for training score-based generative models

    Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,

  17. [17]

    Qinsheng Zhang and Yongxin Chen

    URL https://openreview.net/forum? id=voV_TRqcWh. Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902,

  18. [18]

    Unipc: A unified predictor- corrector framework for fast sampling of diffusion models

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor- corrector framework for fast sampling of diffusion models. arXiv preprint arXiv:2302.04867,

  19. [19]

    Fast sampling of diffusion models via operator learning

    Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449,

  20. [20]

    We use a dropout rate of 0.3 for all consistency models on CIFAR-10

    For iCT-deep models in Tables 2 and 3, we double the depth of base architectures by increasing the number of residual blocks per resolution from 4 and 3 to 8 and 6 for CIFAR-10 and ImageNet64 ˆ64 respectively. We use a dropout rate of 0.3 for all consistency models on CIFAR-10. For ImageNet 64 ˆ 64, we use a dropout rate of 0.2, but only apply them to con...