Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation
Pith reviewed 2026-05-20 12:11 UTC · model grok-4.3
The pith
A warm-up with discrete solutions plus trajectory alignment stabilizes MeanFlow for billion-parameter diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a temporary switch to a discrete solution during warm-up avoids training collapse caused by the stop-gradient term from an undertrained model, after which reverting to the differential solution allows further refinement, while trajectory distribution alignment as an auxiliary objective corrects the mean-seeking bias that otherwise appears under extremely few-step inference on complex target distributions.
What carries the argument
The warm-up technique that temporarily substitutes a discrete solution for the differential solution of MeanFlow, combined with trajectory distribution alignment as an auxiliary objective.
If this is right
- Distillation of 12-billion-parameter models becomes stable and outperforms earlier approaches.
- The framework generalizes without modification to 80-billion-parameter state-of-the-art models.
- Few-step sampling quality improves for text-to-image tasks with complex distributions.
- The same stabilization pattern can be reused when distilling other large diffusion models.
Where Pith is reading between the lines
- The same warm-up pattern might stabilize other velocity-based distillation objectives that rely on stop-gradient terms.
- Automatic detection of when to switch from discrete to differential could remove the need for manual warm-up schedules.
- Extending the alignment loss to video or multimodal generation tasks could accelerate those domains as well.
Load-bearing premise
That switching to a discrete target only during the early phase prevents collapse from the undertrained stop-gradient and that later trajectory alignment is sufficient to correct mean-seeking bias for complex targets.
What would settle it
A run on FLUX.1-dev without the discrete warm-up phase that diverges or produces clearly worse few-step samples than the full method, or a run without trajectory alignment that shows persistent mean-seeking artifacts on complex prompts.
Figures
read the original abstract
Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ''mean-seeking bias'' have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ''mean-seeking bias'' of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model's trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces modifications to the MeanFlow distillation objective for large-scale text-to-image diffusion models. It proposes a warm-up phase that temporarily replaces the differential MeanFlow solution with a discrete one to avoid collapse from stop-gradient terms of an undertrained model, then switches back for refinement. It further adds a trajectory distribution alignment auxiliary loss to mitigate mean-seeking bias under few-step sampling. The central claims are superior performance over prior distillation methods on FLUX.1-dev (up to 12B parameters) and robust generalization to the 80B-parameter HunyuanImage 3.0 model.
Significance. If the reported gains and stability at 12B–80B scale are robustly demonstrated, the work would be significant for practical deployment of distilled industrial-scale T2I models. The explicit handling of MeanFlow instabilities at these scales addresses a known barrier and could influence future distillation pipelines, provided the mechanisms are shown to generalize beyond the specific models tested.
major comments (3)
- [§3.2] §3.2 (Warm-up Strategy): The claim that replacing the differential objective with a discrete solution during warm-up prevents collapse due to the stop-gradient term from an undertrained teacher is load-bearing for the stability argument, yet the manuscript provides no direct metrics (e.g., loss curves, collapse frequency counts, or ablation deltas) comparing training dynamics with and without the switch at the 12B-parameter scale. Without such evidence, it remains unclear whether the switch is necessary or merely sufficient.
- [§3.3] §3.3 (Trajectory Distribution Alignment): The addition of the alignment term is presented as correcting mean-seeking bias for complex targets under few-step sampling, but no quantitative ablation isolates its contribution (e.g., FID or perceptual metrics with/without the term on FLUX.1-dev). This is central to the superiority claim over prior MeanFlow variants.
- [§4.2] §4.2 (Results on FLUX.1-dev): The superiority statement requires explicit numerical comparisons (FID, CLIP score, or human preference rates) against the strongest baselines with error bars or multiple seeds; the current presentation leaves open whether gains are robust or sensitive to post-hoc hyperparameter choices.
minor comments (2)
- [§3] Notation for the discrete versus differential solutions should be introduced with explicit equations early in §3 to avoid ambiguity when describing the switch.
- [Figure 2] Figure captions for training curves should include the exact hyperparameter settings and random seeds used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. The comments highlight important areas where additional evidence can strengthen the presentation of our stability and performance claims. We address each point below and have revised the manuscript accordingly to incorporate the requested analyses.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Warm-up Strategy): The claim that replacing the differential objective with a discrete solution during warm-up prevents collapse due to the stop-gradient term from an undertrained teacher is load-bearing for the stability argument, yet the manuscript provides no direct metrics (e.g., loss curves, collapse frequency counts, or ablation deltas) comparing training dynamics with and without the switch at the 12B-parameter scale. Without such evidence, it remains unclear whether the switch is necessary or merely sufficient.
Authors: We agree that direct comparative metrics at the 12B scale would provide stronger support for the necessity of the warm-up phase. In the revised manuscript we have added loss curves, collapse frequency statistics, and ablation deltas (new Figure 3 and Table 2) that compare training runs with and without the discrete warm-up on FLUX.1-dev. These results show markedly higher variance and collapse events when the differential objective is used from the start, confirming that the temporary discrete solution avoids reliance on unreliable stop-gradient signals from an undertrained model. revision: yes
-
Referee: [§3.3] §3.3 (Trajectory Distribution Alignment): The addition of the alignment term is presented as correcting mean-seeking bias for complex targets under few-step sampling, but no quantitative ablation isolates its contribution (e.g., FID or perceptual metrics with/without the term on FLUX.1-dev). This is central to the superiority claim over prior MeanFlow variants.
Authors: We acknowledge that an isolated ablation of the trajectory distribution alignment term is needed to substantiate its contribution. The revised manuscript now contains a dedicated ablation study in Section 4.3, reporting FID and CLIP scores on FLUX.1-dev both with and without the alignment auxiliary loss. Removing the term produces a measurable degradation in perceptual quality and an increase in mean-seeking artifacts under 4-step sampling, directly supporting its role in the reported gains. revision: yes
-
Referee: [§4.2] §4.2 (Results on FLUX.1-dev): The superiority statement requires explicit numerical comparisons (FID, CLIP score, or human preference rates) against the strongest baselines with error bars or multiple seeds; the current presentation leaves open whether gains are robust or sensitive to post-hoc hyperparameter choices.
Authors: We agree that statistical robustness should be demonstrated explicitly. Section 4.2 has been updated to include FID, CLIP scores, and human preference rates against the strongest baselines, now reported as means with standard deviations computed over three independent random seeds. The consistent positive deltas across seeds indicate that the improvements are robust rather than artifacts of particular hyperparameter selections. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper describes engineering fixes (warm-up discrete solution switch and trajectory distribution alignment) to address stated instabilities and mean-seeking bias in MeanFlow for large-scale distillation. These are presented as direct responses to optimization problems without any equations, derivations, or self-referential definitions that reduce the performance claims to fitted inputs or tautologies by construction. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the abstract or described framework. The superiority claims on FLUX.1-dev and HunyuanImage rest on empirical application rather than circular reductions, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution... switch the optimization objective back to the differential solution
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
trajectory distribution alignment as an auxiliary objective
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AlimamaCreative Team. Flux-turbo, 2024. A 8-step distilled lora for FLUX.1-dev model released by AlimamaCreative Team. 6, 7
work page 2024
-
[2]
HunyuanImage 3.0 Technical Report
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
arXiv preprint arXiv:2510.14974 (2025)
Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, and Sai Bi. pi-flow: Policy-based few- step generation via imitation distillation.arXiv preprint arXiv:2510.14974, 2025. 6, 7
-
[4]
Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. 8 Sana-sprint: One-step diffusion with continuous-time consis- tency distillation.arXiv preprint arXiv:2503.09641, 2025. 3
-
[5]
FlashAttention-2: Faster attention with better paral- lelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 2
work page 2024
-
[6]
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023. 5
work page 2023
-
[9]
Generative adversarial networks.Communi- cations of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communi- cations of the ACM, 63(11):139–144, 2020. 1
work page 2020
-
[10]
Xiao He, Huaao Tang, Zhijun Tu, Junchao Zhang, Kun Cheng, Hanting Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, et al. One step diffusion-based super-resolution with time-aware distillation.arXiv preprint arXiv:2408.07476,
-
[11]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021. 6
work page 2021
-
[12]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2, 3
work page 2020
-
[13]
Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, and Stefano Ermon. Cmt: Mid-training for efficient learning of consis- tency, mean flow, and flow map models.arXiv preprint arXiv:2509.24526, 2025. 3
-
[14]
Consistency trajectory mod- els: Learning probability flow ode trajectory of diffusion
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion.arXiv preprint arXiv:2310.02279, 2023. 2
-
[15]
Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 2, 5, 6, 7
work page 2024
-
[16]
SDXL-Lightning: Progressive Adversarial Diffusion Distillation
Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Lmarena leaderboard: Text-to-image, 2025
LMArena Team. Lmarena leaderboard: Text-to-image, 2025. According to the leaderboard updated in November 2025, HunyuanImage-3.0 ranked #1 in the Text-to-image generation task. 5
work page 2025
-
[20]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Learning few- step diffusion models by trajectory distribution matching
Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory dis- tribution matching.arXiv preprint arXiv:2503.06674, 2025. 3
-
[23]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajec- tory segmented consistency model for efficient image synthe- sis.Advances in Neural Information Processing Systems, 37: 117340–117362, 2024. 6, 7
work page 2024
-
[25]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Fast high- resolution image synthesis with latent adversarial diffusion distillation
Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high- resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2, 3
work page 2024
-
[27]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer,
-
[28]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. 1, 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[29]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 2
work page 2023
-
[30]
Improving and generalizing flow-based generative models with minibatch optimal transport
Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Conditional flow matching: Simulation-free dynamic optimal transport.arXiv preprint arXiv:2302.00482, 2(3), 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 3, 8
work page 2023
-
[32]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Micha¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455– 47487, 2024. 2, 3 9
work page 2024
-
[34]
One-step diffusion with distribution matching distillation
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 2, 3, 8
work page 2024
-
[35]
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Inductive moment matching.arXiv preprint arXiv:2503.07565, 2025
Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching.arXiv preprint arXiv:2503.07565, 2025. 2
-
[37]
Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-first International Confer- ence on Machine Learning, 2024. 3 10
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.