Distribution Matching Distillation without Fake Score Network
Pith reviewed 2026-05-20 07:13 UTC · model grok-4.3
The pith
Flow-map generators can replace the auxiliary fake-score network in distribution matching distillation by using their own endpoint pseudo-velocity as a reverse-divergence proxy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal without an explicit auxiliary network.
What carries the argument
Generator-induced pseudo-velocity surrogate, which replaces the auxiliary fake-score estimator by using the flow-map endpoint velocity to deliver the required reverse-divergence correction.
If this is right
- The derived objective extends DMD-style distribution matching to flow-map generators without the memory and update cost of a second network.
- Flow-map-consistent backward simulation can be added to the training loop for greater stability.
- A self-teacher variant enables the full method to train from scratch without a separate teacher model.
- FSF-DMD reaches lower FID than listed DMD2 comparisons when initialized from flow maps on ImageNet-1K 256x256.
Where Pith is reading between the lines
- The same endpoint-velocity substitution may simplify other distillation procedures that currently maintain separate score estimators for distribution correction.
- If the approximation remains reliable across noise schedules and resolutions, it could reduce the engineering effort required to deploy few-step generators on memory-constrained hardware.
- Evaluating the method on conditional or higher-dimensional generation tasks would test whether the flow-map proxy generalizes beyond the static-image setting reported.
Load-bearing premise
The flow-map structure inherently supplies a sufficiently accurate reverse-divergence signal via its endpoint pseudo-velocity without requiring an explicit auxiliary network or additional corrections that would reintroduce similar overhead.
What would settle it
Train matching flow-map generators with the pseudo-velocity objective versus an explicit fake-score network on the same backbone and data; if the final FID scores or distribution match metrics diverge substantially, or if the pseudo-velocity version fails to improve over the plain flow-map baseline, the proxy claim is falsified.
Figures
read the original abstract
Distribution Matching Distillation (DMD) provides an effective distribution-level correction for few-step generation, while relying on an auxiliary fake-score network to track the evolving generative distribution. Recent work combines DMD-style objectives with flow-map generators to exploit both forward-divergence training and reverse-divergence correction. The fake-score estimator remains an additional component with memory and update overhead. In this work, we study whether this explicit tracker can be avoided when the generator itself has a flow-map structure. We propose Fake-Score-network-Free DMD (FSF-DMD), a DMD formulation for flow-map generators that replaces the auxiliary fake-score estimator with a generator-induced pseudo-velocity surrogate. The key observation is that the endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal. Building on this observation, we derive a practical objective, extend it with flow-map-consistent backward simulation, and introduce a self-teacher variant for training from scratch. In our ImageNet-1K $256 \times 256$ experiments, FSF-DMD improves flow-map baselines, reaches lower FID than the listed DMD2 comparisons in the flow-map-initialized setting, and remains effective under flow-matching initialization and training from scratch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FSF-DMD, a distribution-matching distillation method for flow-map generators that eliminates the auxiliary fake-score network. It replaces the fake-velocity estimator with a generator-induced endpoint pseudo-velocity surrogate to supply the reverse-divergence signal, derives a practical objective, adds flow-map-consistent backward simulation, and introduces a self-teacher variant for training from scratch. Experiments on ImageNet-1K 256x256 report FID improvements over flow-map baselines and competitive or better results than listed DMD2 comparisons under flow-map initialization, flow-matching initialization, and scratch training.
Significance. If the pseudo-velocity proxy is reliable, the work simplifies DMD-style corrections by removing memory and update overhead of a separate network, which is a practical advantage for few-step flow-based generators. The multi-initialization experimental protocol and inclusion of a from-scratch variant provide useful robustness evidence. The approach directly exploits the flow-map structure, which is a clean technical observation.
major comments (2)
- [Derivation of the objective] The central claim that the endpoint pseudo-velocity supplies a sufficiently accurate reverse-divergence signal rests on an unverified assumption that this surrogate tracks the evolving fake distribution without substantial bias. No error bound, bias analysis, or training-dynamics argument is provided to quantify the approximation quality between the pseudo-velocity and the true velocity field of the current generator distribution (see the key observation and objective derivation).
- [Experiments] Table reporting FID scores (ImageNet-1K 256x256): the improvements over DMD2 are stated for the flow-map-initialized setting, but without reported standard deviations across seeds, exact baseline re-implementation details, or an ablation isolating the pseudo-velocity surrogate from the backward-simulation component, it is difficult to attribute gains specifically to the proposed proxy.
minor comments (2)
- [Abstract] Abstract: the phrase 'reaches lower FID than the listed DMD2 comparisons' is imprecise; explicitly name the DMD2 variants and point to the corresponding table/figure for clarity.
- [Notation and preliminaries] Notation: introduce and consistently distinguish 'pseudo-velocity' from true velocity and from the flow-map velocity field at first appearance to prevent reader confusion in the objective and simulation sections.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the constructive major comments. We address each point below and have revised the manuscript accordingly to strengthen the presentation of the derivation and the experimental evidence.
read point-by-point responses
-
Referee: [Derivation of the objective] The central claim that the endpoint pseudo-velocity supplies a sufficiently accurate reverse-divergence signal rests on an unverified assumption that this surrogate tracks the evolving fake distribution without substantial bias. No error bound, bias analysis, or training-dynamics argument is provided to quantify the approximation quality between the pseudo-velocity and the true velocity field of the current generator distribution (see the key observation and objective derivation).
Authors: We acknowledge that the original manuscript presents the pseudo-velocity surrogate as a direct consequence of the flow-map structure without a dedicated error analysis. The derivation relies on the fact that, for a flow-map generator, the endpoint velocity is induced exactly by the generator's own forward mapping, which supplies the reverse-divergence signal by construction. While formal bounds were not derived in the submission, this alignment is consistent with standard Lipschitz assumptions on velocity fields in flow-based models. In the revised manuscript we have expanded the derivation section with a short bias discussion under these assumptions and added empirical training-dynamics plots that track the correlation between the pseudo-velocity and the evolving generator distribution. revision: yes
-
Referee: [Experiments] Table reporting FID scores (ImageNet-1K 256x256): the improvements over DMD2 are stated for the flow-map-initialized setting, but without reported standard deviations across seeds, exact baseline re-implementation details, or an ablation isolating the pseudo-velocity surrogate from the backward-simulation component, it is difficult to attribute gains specifically to the proposed proxy.
Authors: We agree that additional reporting details are necessary to support attribution of the gains. The revised manuscript now includes standard deviations computed over three independent random seeds for all reported FID numbers. We have added an appendix subsection that documents the exact re-implementation of the DMD2 baselines, including optimizer settings, learning-rate schedules, and data-augmentation choices. We have also inserted a new ablation table that isolates the pseudo-velocity surrogate by comparing the full objective against a controlled variant that retains only the flow-map-consistent backward simulation. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper introduces FSF-DMD by proposing that the endpoint pseudo-velocity from a flow-map generator serves as a surrogate for the fake-score network in distribution matching distillation. This is framed as a key observation leading to a derived practical objective, extended with flow-map-consistent backward simulation and a self-teacher variant. No load-bearing steps reduce by construction to fitted parameters, self-citations, or renamed inputs; the surrogate is defined from the generator's structural property rather than from the target distribution-matching result itself. The provided sections contain no self-citation chains, uniqueness theorems, or ansatz smuggling that would force the central claim. The derivation remains self-contained against the flow-map assumption, with the accuracy of the proxy treated as an empirical matter rather than a definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Flow-map generators produce an endpoint pseudo-velocity that can serve as a proxy for the reverse-divergence signal without auxiliary tracking.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation... v_fake(ˆx_t; t) ≈ F_θ(ˆx_t; t, 0) (Eq. 12)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
flow-map generator... semigroup property... injectivity and invertibility
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flow map matching with stochastic interpolants: A mathematical framework for consistency models
Nicholas Matthew Boffi, Michael Samuel Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models. Transactions on Machine Learning Research, 2025
work page 2025
-
[2]
Twinflow: Realizing one-step generation on large models with self-adversarial flows
Zhenglin Cheng, Peng Sun, Jianguo Li, and Tao Lin. Twinflow: Realizing one-step generation on large models with self-adversarial flows. In The Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[3]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009
work page 2009
-
[4]
One step diffusion via shortcut models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[5]
Senseflow: Scaling distribution matching for flow-based text-to-image distillation
Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, and Jun Zhang. Senseflow: Scaling distribution matching for flow-based text-to-image distillation. In The Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[6]
Mean flows for one-step generative modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[7]
Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
work page 2026
-
[8]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associat...
work page 2017
-
[9]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020
work page 2020
-
[10]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021
work page 2021
-
[11]
Self forcing: Bridging the train-test gap in autoregressive video diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[12]
Consistency trajectory models: Learning probability flow ODE trajectory of diffusion
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[13]
Improved precision and recall metric for assessing generative models
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019
work page 2019
-
[14]
Decoupled meanflow: Turning flow models into flow maps for accelerated sampling
Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for accelerated sampling. In The Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[15]
Normuon: Making muon more efficient and scalable, 2025
Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025
work page 2025
-
[16]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. 10
work page 2023
-
[17]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[18]
Simplifying, stabilizing and scaling continuous-time consistency models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[19]
Ma, Xiaohua Xie, and Jian-Huang Lai
Yanzuo Lu, Yuxi Ren, Xin Xia, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Andy J. Ma, Xiaohua Xie, and Jian-Huang Lai. Adversarial distribution matching for diffusion distillation towards efficient image and video synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16818–16829, October 2025
work page 2025
-
[20]
Align your flow: Scaling continuous- time flow map distillation
Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous- time flow map distillation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[21]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, vol- ume 29. Curran Associates, Inc., 2016
work page 2016
-
[22]
Multistep distillation of diffusion models via moment matching
Tim Salimans, Thomas Mensink, Jonathan Heek, and Emiel Hoogeboom. Multistep distillation of diffusion models via moment matching. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[23]
Transition matching: Scalable and flexible generative modeling
Neta Shaul, Uriel Singer, Itai Gat, and Yaron Lipman. Transition matching: Scalable and flexible generative modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[24]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, vol- ume 202 of Proceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2023
work page 2023
-
[25]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[26]
Any-step generation via n-th order recursive consistent velocity field estimation
Peng Sun and Tao Lin. Any-step generation via n-th order recursive consistent velocity field estimation. In The Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[27]
Ddt: Decoupled diffusion transformer, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025
work page 2025
-
[28]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[29]
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[30]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[31]
Large scale diffusion distillation via score-regularized continuous-time consistency
Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency. In The Fourteenth International Conference on Learning Representations, 2026. 11 A Theoretical analysis A.1 Score-Velocity Connection Let ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.