From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity
Pith reviewed 2026-05-17 02:20 UTC · model grok-4.3
The pith
Flow-based diffusion models train in two stages: global navigation early, then local refinement and memorization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The marginal velocity field of flow matching admits a closed-form expression. Computing this oracle target shows that flow-based models are optimized toward a two-stage objective: early on, the velocity is a mixture over data modes, promoting generalization to global layouts; later, it becomes dominated by the nearest data sample, encouraging memorization of details.
What carries the argument
The oracle velocity field, which is the closed-form marginal velocity target in flow matching, that directly reveals the two-stage training dynamic without needing to train a network.
If this is right
- Early training focuses on forming global layouts by generalizing across data modes.
- Later training shifts to memorizing fine-grained details from the nearest sample.
- Techniques like timestep-shifted schedules align with this two-stage process to improve performance.
- Classifier-free guidance intervals and latent space choices can be explained by the navigation-refinement split.
Where Pith is reading between the lines
- The two-stage view suggests designing architectures that handle coarse and fine scales differently at different times.
- Training schedules could be explicitly split into navigation and refinement phases for better control.
- Similar analysis might apply to other generative paradigms like score-based diffusion.
- Monitoring the effective velocity during training could detect when memorization begins.
Load-bearing premise
The closed-form marginal velocity accurately represents the effective training signal that a practical neural network actually optimizes toward.
What would settle it
Train a network on the oracle velocity target computed exactly and check if its learned behavior matches the two-stage pattern observed in standard training.
Figures
read the original abstract
Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements. Our project page is available at: https://maps-research.github.io/from-navigation-to-refinement/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that flow-based diffusion models (via flow matching) possess an inherent two-stage training target that can be exactly characterized by analyzing the closed-form marginal velocity field v^*(x_t, t) derived from the probability path. This yields an early 'navigation' stage in which the target is a mixture of data modes (promoting global layout generalization) and a later 'refinement' stage dominated by the nearest data sample (promoting fine-grained memorization). The authors use this oracle to explain the effectiveness of timestep-shifted schedules, classifier-free guidance intervals, and latent-space design choices, while deepening understanding of memorization-generalization dynamics.
Significance. If the oracle velocity field is shown to be a faithful proxy for the effective training target experienced by practical networks, the work supplies a clean mathematical handle on why flow-based models exhibit distinct early generalization and late memorization regimes. This could guide more principled schedule design and architecture choices. The closed-form derivation itself is a strength, but its interpretive leap to observed network behavior requires further grounding to realize this significance.
major comments (2)
- [Oracle velocity analysis and experimental sections] The central claim that the closed-form marginal velocity v^* reveals the 'inherent' two-stage training target experienced by neural networks is load-bearing yet rests on an unquantified assumption. No experiments or analysis measure the fidelity of a trained v_θ to the oracle's mode-mixture-to-nearest-sample transition (e.g., by tracking effective guidance strength or local vs. global velocity alignment across timesteps). This gap directly affects whether the navigation/refinement distinction holds under finite capacity and SGD dynamics.
- [Discussion of practical techniques] The explanation of practical techniques (timestep-shifted schedules, CFG intervals) is interpretive and would be strengthened by a controlled ablation showing that altering the schedule changes the learned behavior in the precise manner predicted by the oracle two-stage structure, rather than by other factors.
minor comments (2)
- [Section deriving v^*] Clarify the precise definition of 'nearest data sample' dominance in the late-stage velocity field and how it is computed from the closed-form expression.
- [Figures showing velocity fields] Add quantitative metrics (e.g., velocity field divergence or mode-separation scores) to the figures illustrating the two-stage transition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the strength of the closed-form derivation. We address each major comment below, providing clarifications and describing revisions made to strengthen the empirical grounding of our claims.
read point-by-point responses
-
Referee: [Oracle velocity analysis and experimental sections] The central claim that the closed-form marginal velocity v^* reveals the 'inherent' two-stage training target experienced by neural networks is load-bearing yet rests on an unquantified assumption. No experiments or analysis measure the fidelity of a trained v_θ to the oracle's mode-mixture-to-nearest-sample transition (e.g., by tracking effective guidance strength or local vs. global velocity alignment across timesteps). This gap directly affects whether the navigation/refinement distinction holds under finite capacity and SGD dynamics.
Authors: We agree that directly quantifying how closely a trained network approximates the oracle transition is necessary to confirm the practical implications under finite capacity. The oracle v^* is the exact marginal target implied by the probability path, while training regresses to conditional velocities whose expectation yields this marginal; thus the two-stage structure is inherent to the objective itself. To address the gap, the revised manuscript includes new experiments that compute alignment metrics (cosine similarity of v_θ to the oracle's mixture component versus nearest-sample component) across timesteps on trained models. These results show the predicted transition occurs, with a modest lag consistent with capacity limits. The new analysis appears in Section 4.3 with supporting figures. revision: yes
-
Referee: [Discussion of practical techniques] The explanation of practical techniques (timestep-shifted schedules, CFG intervals) is interpretive and would be strengthened by a controlled ablation showing that altering the schedule changes the learned behavior in the precise manner predicted by the oracle two-stage structure, rather than by other factors.
Authors: We concur that interpretive explanations benefit from targeted ablations that isolate the effect predicted by the oracle. In the revised manuscript we add controlled experiments that vary the timestep shift parameter while holding other factors fixed, then measure the resulting change in the timestep at which global mode coverage gives way to fine-detail fidelity (using both qualitative layout metrics and quantitative memorization probes). Analogous ablations are performed for CFG application intervals. The outcomes match the oracle-derived predictions: schedules that extend the navigation regime improve generalization without harming later refinement. These results are reported in Section 5 with new figures and tables. revision: yes
Circularity Check
Derivation of two-stage behavior is self-contained via closed-form oracle
full rationale
The paper derives the marginal velocity field v^*(x_t, t) in closed form directly from the probability path of the flow matching objective and the data distribution. Analysis of this oracle then identifies the early-stage mixture-of-modes behavior and late-stage nearest-sample dominance. This is a direct mathematical computation independent of neural network capacity, optimization dynamics, or any fitted parameters. No load-bearing step reduces by construction to a self-citation, ansatz smuggled via prior work, or renaming of a known empirical pattern. The central claim therefore remains non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The flow-matching objective admits a closed-form marginal velocity field over the data distribution.
Forward citations
Cited by 3 Pith papers
-
Support-Conditioned Flow Matching Is Kernel Smoothing
Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.
-
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
-
Is Flow Matching Just Trajectory Replay for Sequential Data?
Flow matching on time series targets a closed-form nonparametric velocity field that is a similarity-weighted mixture of observed transition velocities, making neural models approximations to an ideal memory-augmented...
Reference graph
Works this paper leans on
-
[1]
Build- ing normalizing flows with stochastic interpolants
Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representa- tions, 2023. 1, 8
work page 2023
-
[2]
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
On the closed-form of flow matching: Gen- eralization does not arise from target stochasticity
Quentin Bertrand, Anne Gagneux, Mathurin Massias, and R´emi Emonet. On the closed-form of flow matching: Gen- eralization does not arise from target stochasticity. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 3
work page 2025
-
[4]
Dynamical regimes of diffusion models.Nature Communications, 15(1):9957, 2024
Giulio Biroli, Tony Bonnaire, Valentin De Bortoli, and Marc M´ezard. Dynamical regimes of diffusion models.Nature Communications, 15(1):9957, 2024. 3
work page 2024
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Why diffusion models don’t memorize: The role of implicit dynamical regularization in training
Tony Bonnaire, Rapha ¨el Urfin, Giulio Biroli, and Marc Mezard. Why diffusion models don’t memorize: The role of implicit dynamical regularization in training. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. 8
work page 2025
-
[7]
Nano banana (gemini 2.5 flash image)
Google DeepMind. Nano banana (gemini 2.5 flash image). https://ai.google.dev/gemini- api/docs/ image-generation, 2025. 1
work page 2025
-
[8]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 2, 3, 6, 11
work page 2009
-
[9]
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 1, 8
work page 2021
-
[10]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 1, 7
work page 2024
-
[11]
One step diffusion via shortcut models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 8
work page 2025
-
[12]
Weiguo Gao and Ming Li. How do flow matching models memorize and generalize in sample data subspaces?arXiv preprint arXiv:2410.23594, 2024. 3
-
[13]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.arXiv preprint arXiv:2505.13447, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
On memorization in diffusion models,
Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. On memorization in diffusion models,
-
[15]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6
work page 2017
-
[16]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 5, 7, 12
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 8
work page 2020
-
[18]
Generalization in diffusion models arises from geometry-adaptive harmonic representations
Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and St´ephane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Conference on Learning Representa- tions, 2024. 2, 8
work page 2024
-
[19]
An analytic theory of cre- ativity in convolutional diffusion models
Mason Kamb and Surya Ganguli. An analytic theory of cre- ativity in convolutional diffusion models. InForty-second International Conference on Machine Learning, 2025. 3
work page 2025
-
[20]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 2
work page 2019
-
[21]
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar- 10 (canadian institute for advanced research).URL http://www.cs.toronto.edu/kriz/cifar.html, 2010. 2
work page 2010
-
[22]
Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Pro- cessing Systems, 37:122458–122483, 2024. 7
work page 2024
-
[23]
Flux.1.https://github.com/ black-forest-labs/flux, 2023
Black Forest Labs. Flux.1.https://github.com/ black-forest-labs/flux, 2023. 1, 15
work page 2023
-
[24]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
The Principles of Diffusion Models
Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. The principles of diffusion models. arXiv preprint arXiv:2510.21890, 2025. 1, 8
work page internal anchor Pith review arXiv 2025
-
[26]
A good score does not lead to a good generative model.arXiv preprint arXiv:2401.04856,
Sixu Li, Shi Chen, and Qin Li. A good score does not lead to a good generative model.arXiv preprint arXiv:2401.04856,
-
[27]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 1, 2, 8
work page 2023
-
[28]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. 1, 2, 3, 8
work page 2023
-
[29]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 12
work page 2019
-
[30]
E. A. Nadaraya. On estimating regression.Theory of Proba- bility & Its Applications, 9(1):141–142, 1964. 11 9
work page 1964
-
[31]
Towards a mechanistic expla- nation of diffusion model generalization
Matthew Niedoba, Berend Zwartsenberg, Kevin Patrick Murphy, and Frank Wood. Towards a mechanistic expla- nation of diffusion model generalization. InForty-second International Conference on Machine Learning, 2025. 8
work page 2025
-
[32]
Sora: A text-to-video generation model.https: //openai.com/index/sora, 2024
OpenAI. Sora: A text-to-video generation model.https: //openai.com/index/sora, 2024. 1
work page 2024
-
[33]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 7, 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 6, 7, 12
work page 2022
-
[37]
Closed-form diffusion models.Transac- tions on Machine Learning Research, 2025
Christopher Scarvelis, Haitz S ´aez de Oc ´ariz Borde, and Justin Solomon. Closed-form diffusion models.Transac- tions on Machine Learning Research, 2025. 3
work page 2025
-
[38]
A closer look at model collapse: From a generalization-to-memorization perspective
Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, and Qing Qu. A closer look at model collapse: From a generalization-to-memorization perspective. InThe Impact of Memorization on Trustworthy Foundation Models: ICML 2025 Workshop, 2025. 2, 8
work page 2025
-
[39]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. PMLR, 2015. 1
work page 2015
-
[40]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2021. 1
work page 2021
-
[41]
Selective underfitting in dif- fusion models.arXiv preprint arXiv:2510.01378, 2025
Kiwhan Song, Jaeyeon Kim, Sitan Chen, Yilun Du, Sham Kakade, and Vincent Sitzmann. Selective underfitting in dif- fusion models.arXiv preprint arXiv:2510.01378, 2025. 2, 8
-
[42]
Yang Song and Stefano Ermon. Generative modeling by esti- mating gradients of the data distribution.Advances in neural information processing systems, 32, 2019. 1, 8
work page 2019
-
[43]
Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.Advances in neural information processing systems, 33:12438–12448, 2020
work page 2020
-
[44]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 8
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[45]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Geoffrey S. Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A (1961-2002), 26 (4):359–372, 1964. 11
work page 1961
-
[47]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 3, 4, 5, 6, 7, 12, 13, 14
work page 2025
-
[48]
Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, and Qing Qu. The emergence of re- producibility and generalizability in diffusion models.arXiv preprint arXiv:2310.05264, 2023. 2, 8
-
[49]
Understanding general- ization in diffusion models via probability flow distance
Huijie Zhang, Zijian Huang, Siyi Chen, Jinfan Zhou, Zekai Zhang, Peng Wang, and Qing Qu. Understanding general- ization in diffusion models via probability flow distance. In High-dimensional Learning Dynamics 2025, 2025. 2, 8
work page 2025
-
[50]
Alphaflow: Understanding and improving meanflow models.arXiv preprint arXiv:2510.20771, 2025
Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Skorokhodov. Alphaflow: Understanding and improving meanflow models.arXiv preprint arXiv:2510.20771, 2025. 8 10 A. Proof of Theorem 2.1 The Flow Matching (FM) objective (Eq. 8) is given by: LFM(θ) =E t, pt(xt)||vt(xt;θ)−u t(xt)||2.(8) The marginal ve...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.