pith. sign in

arxiv: 2605.19256 · v1 · pith:NZX5WW63new · submitted 2026-05-19 · 💻 cs.CV

Distribution Matching Distillation without Fake Score Network

Pith reviewed 2026-05-20 07:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords distribution matching distillationflow-map generatorspseudo-velocityfake-score networkfew-step generationreverse-divergenceImageNet-1K
0
0 comments X

The pith

Flow-map generators can replace the auxiliary fake-score network in distribution matching distillation by using their own endpoint pseudo-velocity as a reverse-divergence proxy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper asks whether the extra fake-score network needed in distribution matching distillation can be removed once the generator itself follows a flow-map structure. It claims that the endpoint pseudo-velocity already produced inside the flow-map generator works as a usable proxy for the fake-velocity signal that supplies the reverse-divergence correction. If the substitution holds, training and memory costs drop because only a single network is needed while the method still performs distribution-level matching rather than pointwise losses. The authors derive a practical objective from this observation, add flow-map-consistent backward simulation, and introduce a self-teacher variant that trains from scratch. Experiments on ImageNet-1K at 256 by 256 resolution show that the resulting FSF-DMD improves flow-map baselines and reaches lower FID than the listed DMD2 comparisons in the flow-map-initialized setting.

Core claim

The endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal without an explicit auxiliary network.

What carries the argument

Generator-induced pseudo-velocity surrogate, which replaces the auxiliary fake-score estimator by using the flow-map endpoint velocity to deliver the required reverse-divergence correction.

If this is right

  • The derived objective extends DMD-style distribution matching to flow-map generators without the memory and update cost of a second network.
  • Flow-map-consistent backward simulation can be added to the training loop for greater stability.
  • A self-teacher variant enables the full method to train from scratch without a separate teacher model.
  • FSF-DMD reaches lower FID than listed DMD2 comparisons when initialized from flow maps on ImageNet-1K 256x256.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same endpoint-velocity substitution may simplify other distillation procedures that currently maintain separate score estimators for distribution correction.
  • If the approximation remains reliable across noise schedules and resolutions, it could reduce the engineering effort required to deploy few-step generators on memory-constrained hardware.
  • Evaluating the method on conditional or higher-dimensional generation tasks would test whether the flow-map proxy generalizes beyond the static-image setting reported.

Load-bearing premise

The flow-map structure inherently supplies a sufficiently accurate reverse-divergence signal via its endpoint pseudo-velocity without requiring an explicit auxiliary network or additional corrections that would reintroduce similar overhead.

What would settle it

Train matching flow-map generators with the pseudo-velocity objective versus an explicit fake-score network on the same backbone and data; if the final FID scores or distribution match metrics diverge substantially, or if the pseudo-velocity version fails to improve over the plain flow-map baseline, the proxy claim is falsified.

Figures

Figures reproduced from arXiv: 2605.19256 by Deokyeong Lee, Jaesik Park, Youngjoong Kim.

Figure 1
Figure 1. Figure 1: Comparison between DMD2 and FSF-DMD. The distribution matching objective is computed as the discrepancy between fake and real scores. Let sΦ be a teacher score network, xt a perturbed data sample, and xˆt a perturbed generated sample. By the score-velocity connection, the same objective can be written using the corresponding teacher velocity network vΦ (Eq. 8). (Left) Standard DMD requires a fake-score net… view at source ↗
Figure 2
Figure 2. Figure 2: From explicit fake-score tracking to Fake-Score-network-Free DMD (FSF-DMD). With a teacher velocity network vΦ, DMD applies a teacher–fake distributional correction LDMD in Eq. (8), but usually estimates the fake velocity with an auxiliary network vψ (Fig. 2a). Consistency distillation LCD in Eq. (16) anchors the generator to the flow-map structure, but does not include this fake-side correction (Fig. 2b).… view at source ↗
Figure 3
Figure 3. Figure 3: FID10K over training steps (log-scale) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 2-step FSF-DMD samples on ImageNet-1K 256, flow-map-initialized. In summary, the experiments support FSF-DMD as a distribution-matching method without an explicit fake-score network for flow-map generators. The flow-map-initialized comparison suggests that the generator-induced surrogate can provide an effective correction in the studied setting. The relaxed initial￾ization result suggests that the method … view at source ↗
read the original abstract

Distribution Matching Distillation (DMD) provides an effective distribution-level correction for few-step generation, while relying on an auxiliary fake-score network to track the evolving generative distribution. Recent work combines DMD-style objectives with flow-map generators to exploit both forward-divergence training and reverse-divergence correction. The fake-score estimator remains an additional component with memory and update overhead. In this work, we study whether this explicit tracker can be avoided when the generator itself has a flow-map structure. We propose Fake-Score-network-Free DMD (FSF-DMD), a DMD formulation for flow-map generators that replaces the auxiliary fake-score estimator with a generator-induced pseudo-velocity surrogate. The key observation is that the endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal. Building on this observation, we derive a practical objective, extend it with flow-map-consistent backward simulation, and introduce a self-teacher variant for training from scratch. In our ImageNet-1K $256 \times 256$ experiments, FSF-DMD improves flow-map baselines, reaches lower FID than the listed DMD2 comparisons in the flow-map-initialized setting, and remains effective under flow-matching initialization and training from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FSF-DMD, a distribution-matching distillation method for flow-map generators that eliminates the auxiliary fake-score network. It replaces the fake-velocity estimator with a generator-induced endpoint pseudo-velocity surrogate to supply the reverse-divergence signal, derives a practical objective, adds flow-map-consistent backward simulation, and introduces a self-teacher variant for training from scratch. Experiments on ImageNet-1K 256x256 report FID improvements over flow-map baselines and competitive or better results than listed DMD2 comparisons under flow-map initialization, flow-matching initialization, and scratch training.

Significance. If the pseudo-velocity proxy is reliable, the work simplifies DMD-style corrections by removing memory and update overhead of a separate network, which is a practical advantage for few-step flow-based generators. The multi-initialization experimental protocol and inclusion of a from-scratch variant provide useful robustness evidence. The approach directly exploits the flow-map structure, which is a clean technical observation.

major comments (2)
  1. [Derivation of the objective] The central claim that the endpoint pseudo-velocity supplies a sufficiently accurate reverse-divergence signal rests on an unverified assumption that this surrogate tracks the evolving fake distribution without substantial bias. No error bound, bias analysis, or training-dynamics argument is provided to quantify the approximation quality between the pseudo-velocity and the true velocity field of the current generator distribution (see the key observation and objective derivation).
  2. [Experiments] Table reporting FID scores (ImageNet-1K 256x256): the improvements over DMD2 are stated for the flow-map-initialized setting, but without reported standard deviations across seeds, exact baseline re-implementation details, or an ablation isolating the pseudo-velocity surrogate from the backward-simulation component, it is difficult to attribute gains specifically to the proposed proxy.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'reaches lower FID than the listed DMD2 comparisons' is imprecise; explicitly name the DMD2 variants and point to the corresponding table/figure for clarity.
  2. [Notation and preliminaries] Notation: introduce and consistently distinguish 'pseudo-velocity' from true velocity and from the flow-map velocity field at first appearance to prevent reader confusion in the objective and simulation sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive major comments. We address each point below and have revised the manuscript accordingly to strengthen the presentation of the derivation and the experimental evidence.

read point-by-point responses
  1. Referee: [Derivation of the objective] The central claim that the endpoint pseudo-velocity supplies a sufficiently accurate reverse-divergence signal rests on an unverified assumption that this surrogate tracks the evolving fake distribution without substantial bias. No error bound, bias analysis, or training-dynamics argument is provided to quantify the approximation quality between the pseudo-velocity and the true velocity field of the current generator distribution (see the key observation and objective derivation).

    Authors: We acknowledge that the original manuscript presents the pseudo-velocity surrogate as a direct consequence of the flow-map structure without a dedicated error analysis. The derivation relies on the fact that, for a flow-map generator, the endpoint velocity is induced exactly by the generator's own forward mapping, which supplies the reverse-divergence signal by construction. While formal bounds were not derived in the submission, this alignment is consistent with standard Lipschitz assumptions on velocity fields in flow-based models. In the revised manuscript we have expanded the derivation section with a short bias discussion under these assumptions and added empirical training-dynamics plots that track the correlation between the pseudo-velocity and the evolving generator distribution. revision: yes

  2. Referee: [Experiments] Table reporting FID scores (ImageNet-1K 256x256): the improvements over DMD2 are stated for the flow-map-initialized setting, but without reported standard deviations across seeds, exact baseline re-implementation details, or an ablation isolating the pseudo-velocity surrogate from the backward-simulation component, it is difficult to attribute gains specifically to the proposed proxy.

    Authors: We agree that additional reporting details are necessary to support attribution of the gains. The revised manuscript now includes standard deviations computed over three independent random seeds for all reported FID numbers. We have added an appendix subsection that documents the exact re-implementation of the DMD2 baselines, including optimizer settings, learning-rate schedules, and data-augmentation choices. We have also inserted a new ablation table that isolates the pseudo-velocity surrogate by comparing the full objective against a controlled variant that retains only the flow-map-consistent backward simulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces FSF-DMD by proposing that the endpoint pseudo-velocity from a flow-map generator serves as a surrogate for the fake-score network in distribution matching distillation. This is framed as a key observation leading to a derived practical objective, extended with flow-map-consistent backward simulation and a self-teacher variant. No load-bearing steps reduce by construction to fitted parameters, self-citations, or renamed inputs; the surrogate is defined from the generator's structural property rather than from the target distribution-matching result itself. The provided sections contain no self-citation chains, uniqueness theorems, or ansatz smuggling that would force the central claim. The derivation remains self-contained against the flow-map assumption, with the accuracy of the proxy treated as an empirical matter rather than a definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the flow-map generator structure providing a valid proxy and on the effectiveness of the derived objective plus backward simulation; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Flow-map generators produce an endpoint pseudo-velocity that can serve as a proxy for the reverse-divergence signal without auxiliary tracking.
    This is the key observation stated in the abstract that enables removal of the fake-score network.

pith-pipeline@v0.9.0 · 5756 in / 1227 out tokens · 40094 ms · 2026-05-20T07:13:58.823089+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Flow map matching with stochastic interpolants: A mathematical framework for consistency models

    Nicholas Matthew Boffi, Michael Samuel Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models. Transactions on Machine Learning Research, 2025

  2. [2]

    Twinflow: Realizing one-step generation on large models with self-adversarial flows

    Zhenglin Cheng, Peng Sun, Jianguo Li, and Tao Lin. Twinflow: Realizing one-step generation on large models with self-adversarial flows. In The Fourteenth International Conference on Learning Representations, 2026

  3. [3]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009

  4. [4]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations, 2025

  5. [5]

    Senseflow: Scaling distribution matching for flow-based text-to-image distillation

    Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, and Jun Zhang. Senseflow: Scaling distribution matching for flow-based text-to-image distillation. In The Fourteenth International Conference on Learning Representations, 2026

  6. [6]

    Mean flows for one-step generative modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  7. [7]

    Zico Kolter, and Kaiming He

    Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  8. [8]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associat...

  9. [9]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020

  10. [10]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

  11. [11]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  12. [12]

    Consistency trajectory models: Learning probability flow ODE trajectory of diffusion

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. InThe Twelfth International Conference on Learning Representations, 2024

  13. [13]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  14. [14]

    Decoupled meanflow: Turning flow models into flow maps for accelerated sampling

    Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for accelerated sampling. In The Fourteenth International Conference on Learning Representations, 2026

  15. [15]

    Normuon: Making muon more efficient and scalable, 2025

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025

  16. [16]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. 10

  17. [17]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023

  18. [18]

    Simplifying, stabilizing and scaling continuous-time consistency models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. In The Thirteenth International Conference on Learning Representations, 2025

  19. [19]

    Ma, Xiaohua Xie, and Jian-Huang Lai

    Yanzuo Lu, Yuxi Ren, Xin Xia, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Andy J. Ma, Xiaohua Xie, and Jian-Huang Lai. Adversarial distribution matching for diffusion distillation towards efficient image and video synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16818–16829, October 2025

  20. [20]

    Align your flow: Scaling continuous- time flow map distillation

    Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous- time flow map distillation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  21. [21]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, vol- ume 29. Curran Associates, Inc., 2016

  22. [22]

    Multistep distillation of diffusion models via moment matching

    Tim Salimans, Thomas Mensink, Jonathan Heek, and Emiel Hoogeboom. Multistep distillation of diffusion models via moment matching. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  23. [23]

    Transition matching: Scalable and flexible generative modeling

    Neta Shaul, Uriel Singer, Itai Gat, and Yaron Lipman. Transition matching: Scalable and flexible generative modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  24. [24]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, vol- ume 202 of Proceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2023

  25. [25]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  26. [26]

    Any-step generation via n-th order recursive consistent velocity field estimation

    Peng Sun and Tao Lin. Any-step generation via n-th order recursive consistent velocity field estimation. In The Fourteenth International Conference on Learning Representations, 2026

  27. [27]

    Ddt: Decoupled diffusion transformer, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025

  28. [28]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  29. [29]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  30. [30]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  31. [31]

    Large scale diffusion distillation via score-regularized continuous-time consistency

    Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency. In The Fourteenth International Conference on Learning Representations, 2026. 11 A Theoretical analysis A.1 Score-Velocity Connection Let ...