pith. machine review for the scientific record. sign in

arxiv: 2512.02826 · v3 · submitted 2025-12-02 · 💻 cs.LG · cs.AI

From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

Pith reviewed 2026-05-17 02:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords flow matchingdiffusion modelstwo-stage trainingoracle velocitymemorizationgeneralizationvelocity field
0
0 comments X

The pith

Flow-based diffusion models train in two stages: global navigation early, then local refinement and memorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes the exact velocity field that flow matching aims to learn. It finds that the target splits naturally into an early phase where the model learns from a blend of all data points and a late phase where it focuses on the closest example. This explains why these models first build broad structures and later copy fine details. The insight accounts for why certain training tricks like changing timestep schedules or guidance intervals work well.

Core claim

The marginal velocity field of flow matching admits a closed-form expression. Computing this oracle target shows that flow-based models are optimized toward a two-stage objective: early on, the velocity is a mixture over data modes, promoting generalization to global layouts; later, it becomes dominated by the nearest data sample, encouraging memorization of details.

What carries the argument

The oracle velocity field, which is the closed-form marginal velocity target in flow matching, that directly reveals the two-stage training dynamic without needing to train a network.

If this is right

  • Early training focuses on forming global layouts by generalizing across data modes.
  • Later training shifts to memorizing fine-grained details from the nearest sample.
  • Techniques like timestep-shifted schedules align with this two-stage process to improve performance.
  • Classifier-free guidance intervals and latent space choices can be explained by the navigation-refinement split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage view suggests designing architectures that handle coarse and fine scales differently at different times.
  • Training schedules could be explicitly split into navigation and refinement phases for better control.
  • Similar analysis might apply to other generative paradigms like score-based diffusion.
  • Monitoring the effective velocity during training could detect when memorization begins.

Load-bearing premise

The closed-form marginal velocity accurately represents the effective training signal that a practical neural network actually optimizes toward.

What would settle it

Train a network on the oracle velocity target computed exactly and check if its learned behavior matches the two-stage pattern observed in standard training.

Figures

Figures reproduced from arXiv: 2512.02826 by Haoming Liu, Hongyi Wen, Jinnuo Liu, Liuyang Bai, Shenji Wan, Yanhao Li, Yuanhe Guo, Yunkai Ji.

Figure 1
Figure 1. Figure 1: Illustration of the two stages in flow-based diffusion [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) MSE between u ∗ t and the CFM target(x1−x0) across timesteps; (b) Average top-1 posterior weight γi(xt, t) showing rapid concentration after t = 0.1; both plots reveal a clear two￾stage behavior emerging in the oracle training target [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plots of top-1 posterior weight under varying conditions. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Intermediate predictions of a LightningDiT-XL/1 [ [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mixed sampling results with switch point [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results for refinement generalization. When [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training loss trends across timesteps. (a) Training losses [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of model prediction trends. (a) Norm of veloc [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Convergence of oracle loss under different latent spaces: [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Convergence of gFID@5K when training rectified flow [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Convergence of gFID@5K when training rectified flow [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Quantitative results for oracle-model mixed generation. [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Mixed sampling results with switch point [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Intermediate predictions of a LightningDiT-XL/1 [ [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative illustration of two-stage behavior in [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
read the original abstract

Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements. Our project page is available at: https://maps-research.github.io/from-navigation-to-refinement/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that flow-based diffusion models (via flow matching) possess an inherent two-stage training target that can be exactly characterized by analyzing the closed-form marginal velocity field v^*(x_t, t) derived from the probability path. This yields an early 'navigation' stage in which the target is a mixture of data modes (promoting global layout generalization) and a later 'refinement' stage dominated by the nearest data sample (promoting fine-grained memorization). The authors use this oracle to explain the effectiveness of timestep-shifted schedules, classifier-free guidance intervals, and latent-space design choices, while deepening understanding of memorization-generalization dynamics.

Significance. If the oracle velocity field is shown to be a faithful proxy for the effective training target experienced by practical networks, the work supplies a clean mathematical handle on why flow-based models exhibit distinct early generalization and late memorization regimes. This could guide more principled schedule design and architecture choices. The closed-form derivation itself is a strength, but its interpretive leap to observed network behavior requires further grounding to realize this significance.

major comments (2)
  1. [Oracle velocity analysis and experimental sections] The central claim that the closed-form marginal velocity v^* reveals the 'inherent' two-stage training target experienced by neural networks is load-bearing yet rests on an unquantified assumption. No experiments or analysis measure the fidelity of a trained v_θ to the oracle's mode-mixture-to-nearest-sample transition (e.g., by tracking effective guidance strength or local vs. global velocity alignment across timesteps). This gap directly affects whether the navigation/refinement distinction holds under finite capacity and SGD dynamics.
  2. [Discussion of practical techniques] The explanation of practical techniques (timestep-shifted schedules, CFG intervals) is interpretive and would be strengthened by a controlled ablation showing that altering the schedule changes the learned behavior in the precise manner predicted by the oracle two-stage structure, rather than by other factors.
minor comments (2)
  1. [Section deriving v^*] Clarify the precise definition of 'nearest data sample' dominance in the late-stage velocity field and how it is computed from the closed-form expression.
  2. [Figures showing velocity fields] Add quantitative metrics (e.g., velocity field divergence or mode-separation scores) to the figures illustrating the two-stage transition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the strength of the closed-form derivation. We address each major comment below, providing clarifications and describing revisions made to strengthen the empirical grounding of our claims.

read point-by-point responses
  1. Referee: [Oracle velocity analysis and experimental sections] The central claim that the closed-form marginal velocity v^* reveals the 'inherent' two-stage training target experienced by neural networks is load-bearing yet rests on an unquantified assumption. No experiments or analysis measure the fidelity of a trained v_θ to the oracle's mode-mixture-to-nearest-sample transition (e.g., by tracking effective guidance strength or local vs. global velocity alignment across timesteps). This gap directly affects whether the navigation/refinement distinction holds under finite capacity and SGD dynamics.

    Authors: We agree that directly quantifying how closely a trained network approximates the oracle transition is necessary to confirm the practical implications under finite capacity. The oracle v^* is the exact marginal target implied by the probability path, while training regresses to conditional velocities whose expectation yields this marginal; thus the two-stage structure is inherent to the objective itself. To address the gap, the revised manuscript includes new experiments that compute alignment metrics (cosine similarity of v_θ to the oracle's mixture component versus nearest-sample component) across timesteps on trained models. These results show the predicted transition occurs, with a modest lag consistent with capacity limits. The new analysis appears in Section 4.3 with supporting figures. revision: yes

  2. Referee: [Discussion of practical techniques] The explanation of practical techniques (timestep-shifted schedules, CFG intervals) is interpretive and would be strengthened by a controlled ablation showing that altering the schedule changes the learned behavior in the precise manner predicted by the oracle two-stage structure, rather than by other factors.

    Authors: We concur that interpretive explanations benefit from targeted ablations that isolate the effect predicted by the oracle. In the revised manuscript we add controlled experiments that vary the timestep shift parameter while holding other factors fixed, then measure the resulting change in the timestep at which global mode coverage gives way to fine-detail fidelity (using both qualitative layout metrics and quantitative memorization probes). Analogous ablations are performed for CFG application intervals. The outcomes match the oracle-derived predictions: schedules that extend the navigation regime improve generalization without harming later refinement. These results are reported in Section 5 with new figures and tables. revision: yes

Circularity Check

0 steps flagged

Derivation of two-stage behavior is self-contained via closed-form oracle

full rationale

The paper derives the marginal velocity field v^*(x_t, t) in closed form directly from the probability path of the flow matching objective and the data distribution. Analysis of this oracle then identifies the early-stage mixture-of-modes behavior and late-stage nearest-sample dominance. This is a direct mathematical computation independent of neural network capacity, optimization dynamics, or any fitted parameters. No load-bearing step reduces by construction to a self-citation, ansatz smuggled via prior work, or renaming of a known empirical pattern. The central claim therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a closed-form marginal velocity for the flow-matching objective and the assumption that this oracle target governs practical training dynamics. No new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The flow-matching objective admits a closed-form marginal velocity field over the data distribution.
    Invoked to enable exact oracle computation without simulation.

pith-pipeline@v0.9.0 · 5517 in / 1144 out tokens · 24516 ms · 2026-05-17T02:20:22.976840+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Support-Conditioned Flow Matching Is Kernel Smoothing

    cs.LG 2026-05 accept novelty 8.0

    Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.

  2. CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    cs.CV 2026-04 unverdicted novelty 7.0

    CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

  3. Is Flow Matching Just Trajectory Replay for Sequential Data?

    stat.ML 2026-02 unverdicted novelty 7.0

    Flow matching on time series targets a closed-form nonparametric velocity field that is a similarity-weighted mixture of observed transition velocities, making neural models approximations to an ideal memory-augmented...

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 11 internal anchors

  1. [1]

    Build- ing normalizing flows with stochastic interpolants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representa- tions, 2023. 1, 8

  2. [2]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,

  3. [3]

    On the closed-form of flow matching: Gen- eralization does not arise from target stochasticity

    Quentin Bertrand, Anne Gagneux, Mathurin Massias, and R´emi Emonet. On the closed-form of flow matching: Gen- eralization does not arise from target stochasticity. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 3

  4. [4]

    Dynamical regimes of diffusion models.Nature Communications, 15(1):9957, 2024

    Giulio Biroli, Tony Bonnaire, Valentin De Bortoli, and Marc M´ezard. Dynamical regimes of diffusion models.Nature Communications, 15(1):9957, 2024. 3

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

  6. [6]

    Why diffusion models don’t memorize: The role of implicit dynamical regularization in training

    Tony Bonnaire, Rapha ¨el Urfin, Giulio Biroli, and Marc Mezard. Why diffusion models don’t memorize: The role of implicit dynamical regularization in training. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. 8

  7. [7]

    Nano banana (gemini 2.5 flash image)

    Google DeepMind. Nano banana (gemini 2.5 flash image). https://ai.google.dev/gemini- api/docs/ image-generation, 2025. 1

  8. [8]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 3, 6, 11

  9. [9]

    Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 1, 8

  10. [10]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 1, 7

  11. [11]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 8

  12. [12]

    How do flow matching models memorize and generalize in sample data subspaces?arXiv preprint arXiv:2410.23594, 2024

    Weiguo Gao and Ming Li. How do flow matching models memorize and generalize in sample data subspaces?arXiv preprint arXiv:2410.23594, 2024. 3

  13. [13]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.arXiv preprint arXiv:2505.13447, 2025. 8

  14. [14]

    On memorization in diffusion models,

    Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. On memorization in diffusion models,

  15. [15]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

  16. [16]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 5, 7, 12

  17. [17]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 8

  18. [18]

    Generalization in diffusion models arises from geometry-adaptive harmonic representations

    Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and St´ephane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Conference on Learning Representa- tions, 2024. 2, 8

  19. [19]

    An analytic theory of cre- ativity in convolutional diffusion models

    Mason Kamb and Surya Ganguli. An analytic theory of cre- ativity in convolutional diffusion models. InForty-second International Conference on Machine Learning, 2025. 3

  20. [20]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 2

  21. [21]

    Cifar- 10 (canadian institute for advanced research).URL http://www.cs.toronto.edu/kriz/cifar.html, 2010

    Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar- 10 (canadian institute for advanced research).URL http://www.cs.toronto.edu/kriz/cifar.html, 2010. 2

  22. [22]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Pro- cessing Systems, 37:122458–122483, 2024

    Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Pro- cessing Systems, 37:122458–122483, 2024. 7

  23. [23]

    Flux.1.https://github.com/ black-forest-labs/flux, 2023

    Black Forest Labs. Flux.1.https://github.com/ black-forest-labs/flux, 2023. 1, 15

  24. [24]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  25. [25]

    The Principles of Diffusion Models

    Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. The principles of diffusion models. arXiv preprint arXiv:2510.21890, 2025. 1, 8

  26. [26]

    A good score does not lead to a good generative model.arXiv preprint arXiv:2401.04856,

    Sixu Li, Shi Chen, and Qin Li. A good score does not lead to a good generative model.arXiv preprint arXiv:2401.04856,

  27. [27]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 1, 2, 8

  28. [28]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. 1, 2, 3, 8

  29. [29]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 12

  30. [30]

    E. A. Nadaraya. On estimating regression.Theory of Proba- bility & Its Applications, 9(1):141–142, 1964. 11 9

  31. [31]

    Towards a mechanistic expla- nation of diffusion model generalization

    Matthew Niedoba, Berend Zwartsenberg, Kevin Patrick Murphy, and Frank Wood. Towards a mechanistic expla- nation of diffusion model generalization. InForty-second International Conference on Machine Learning, 2025. 8

  32. [32]

    Sora: A text-to-video generation model.https: //openai.com/index/sora, 2024

    OpenAI. Sora: A text-to-video generation model.https: //openai.com/index/sora, 2024. 1

  33. [33]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 7, 12

  34. [34]

    Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

    Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025. 1

  35. [35]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1

  36. [36]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 6, 7, 12

  37. [37]

    Closed-form diffusion models.Transac- tions on Machine Learning Research, 2025

    Christopher Scarvelis, Haitz S ´aez de Oc ´ariz Borde, and Justin Solomon. Closed-form diffusion models.Transac- tions on Machine Learning Research, 2025. 3

  38. [38]

    A closer look at model collapse: From a generalization-to-memorization perspective

    Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, and Qing Qu. A closer look at model collapse: From a generalization-to-memorization perspective. InThe Impact of Memorization on Trustworthy Foundation Models: ICML 2025 Workshop, 2025. 2, 8

  39. [39]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. PMLR, 2015. 1

  40. [40]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2021. 1

  41. [41]

    Selective underfitting in dif- fusion models.arXiv preprint arXiv:2510.01378, 2025

    Kiwhan Song, Jaeyeon Kim, Sitan Chen, Yilun Du, Sham Kakade, and Vincent Sitzmann. Selective underfitting in dif- fusion models.arXiv preprint arXiv:2510.01378, 2025. 2, 8

  42. [42]

    Generative modeling by esti- mating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by esti- mating gradients of the data distribution.Advances in neural information processing systems, 32, 2019. 1, 8

  43. [43]

    Improved techniques for training score-based generative models.Advances in neural information processing systems, 33:12438–12448, 2020

    Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.Advances in neural information processing systems, 33:12438–12448, 2020

  44. [44]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 8

  45. [45]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1

  46. [46]

    Geoffrey S. Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A (1961-2002), 26 (4):359–372, 1964. 11

  47. [47]

    Reconstruc- tion vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 3, 4, 5, 6, 7, 12, 13, 14

  48. [48]

    The emergence of re- producibility and generalizability in diffusion models.arXiv preprint arXiv:2310.05264, 2023

    Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, and Qing Qu. The emergence of re- producibility and generalizability in diffusion models.arXiv preprint arXiv:2310.05264, 2023. 2, 8

  49. [49]

    Understanding general- ization in diffusion models via probability flow distance

    Huijie Zhang, Zijian Huang, Siyi Chen, Jinfan Zhou, Zekai Zhang, Peng Wang, and Qing Qu. Understanding general- ization in diffusion models via probability flow distance. In High-dimensional Learning Dynamics 2025, 2025. 2, 8

  50. [50]

    Alphaflow: Understanding and improving meanflow models.arXiv preprint arXiv:2510.20771, 2025

    Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Skorokhodov. Alphaflow: Understanding and improving meanflow models.arXiv preprint arXiv:2510.20771, 2025. 8 10 A. Proof of Theorem 2.1 The Flow Matching (FM) objective (Eq. 8) is given by: LFM(θ) =E t, pt(xt)||vt(xt;θ)−u t(xt)||2.(8) The marginal ve...