On Variance Reduction in Learning Mean Flows
Pith reviewed 2026-05-12 05:08 UTC · model grok-4.3
The pith
Correcting the coefficient on the conditional velocity field stabilizes MeanFlow training and improves sample quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that the pathology of MeanFlow training originates from an incorrect coefficient multiplying the conditional velocity field inside the loss. This field simultaneously provides the regression target and serves as a Monte Carlo control variate for the Jacobi-vector product; the original loss assigns it the wrong weight for the control-variate term. The authors derive the optimal coefficient in closed form and demonstrate that a range of concurrent stabilization techniques are merely different practical implementations of this same optimum. Empirical sweeps on two-dimensional benchmarks and latent Diffusion Transformers recover the predicted ordering of bias and variance.
What carries the argument
The closed-form optimal coefficient that correctly weights the conditional velocity field when it functions as a control variate in the vector-Jacobian product term of the loss.
Load-bearing premise
The conditional velocity field simultaneously acts as an unbiased regression target and as a Monte Carlo control variate whose coefficient in the loss must be chosen separately from the regression term.
What would settle it
If training with the derived optimal coefficient on the reported two-dimensional benchmarks fails to produce both lower gradient variance and higher sample quality than the original MeanFlow loss, the attribution of the instability to the mis-specified coefficient would be falsified.
Figures
read the original abstract
One-step generative modeling has emerged as a leading approach to amortize the inference cost of diffusion and flow-matching models. Among distillation-free methods, MeanFlow training is notoriously unstable, with non-decreasing loss and unbounded gradient variance. In this work, we establish a theory that attributes this pathology to a misuse of the conditional velocity field: it plays two distinct statistical roles in the loss, both as an unbiased regression target and as a Monte Carlo control variate inside a Jacobi-vector product, with the original loss assigning the wrong coefficient to the latter. We derive the optimal coefficient in closed form, and show that a family of fixes in concurrent works corresponds to different practical realizations of the same optimum. A controlled sweep of this coefficient on two-dimensional benchmarks and on a latent Diffusion Transformer recovers the predicted bias-variance ordering. The optimal coefficient yields up to a %54 improvement in sample quality on two-dimensional benchmarks and a monotone FID trend at every matched-step DiT checkpoint. Crucially, the same DiT measurement also reveals a quantitative FID-MSE landscape mismatch: although gradient variance is minimized at an interior coefficient value, the coefficient that minimizes FID prefers the direct use of conditional velocity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that instability in MeanFlow training stems from the conditional velocity field being assigned an incorrect coefficient in the loss, as it simultaneously serves as an unbiased regression target and a Monte Carlo control variate within the Jacobi-vector product. The authors derive the variance-optimal coefficient in closed form, unify several concurrent fixes as alternative realizations of the same optimum, and report that controlled coefficient sweeps on 2D benchmarks and latent DiT models recover the predicted bias-variance ordering, deliver up to 54% sample-quality gains, and produce monotone FID trends, while explicitly noting a quantitative mismatch in which gradient variance is minimized at an interior coefficient but FID is minimized at the boundary value corresponding to direct use of the conditional velocity.
Significance. If the derivation is sound, the work supplies a principled statistical account of MeanFlow pathology and a parameter-free correction that could stabilize one-step generative models while explaining prior heuristics. The closed-form result and the unification of concurrent methods are clear strengths. The reported empirical mismatch between the variance minimum and the FID minimum, however, weakens the causal attribution of quality gains to the proposed coefficient and suggests that unmodeled optimization or sampling dynamics may be responsible for the observed improvements.
major comments (2)
- [Abstract] Abstract: the manuscript states that gradient variance is minimized at an interior coefficient while FID is minimized by the boundary value that recovers direct conditional-velocity use. This quantitative mismatch between the quantity optimized by the theory (gradient variance) and the downstream metric (FID/sample quality) means the attribution of the reported 54% improvement and monotone FID trend to the derived coefficient is not fully supported; other factors may drive the gains.
- [Theory derivation] The central derivation (presumably §3) models the conditional velocity as playing two distinct statistical roles and derives a closed-form coefficient that corrects only the control-variate role. It is unclear from the provided description whether this correction preserves unbiasedness of the regression target or introduces a new bias term; an explicit expansion of the loss and the Jacobi-vector product is needed to confirm that the optimum does not trade one source of bias for another.
minor comments (2)
- [Abstract] The abstract contains a typographical error: '%54' should read '54%'.
- [Experiments] Experimental sections should include the precise definition of the coefficient sweep range, the exact DiT checkpoint matching procedure, and raw variance/FID values (not only trends) to allow independent verification of the bias-variance ordering.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the manuscript states that gradient variance is minimized at an interior coefficient while FID is minimized by the boundary value that recovers direct conditional-velocity use. This quantitative mismatch between the quantity optimized by the theory (gradient variance) and the downstream metric (FID/sample quality) means the attribution of the reported 54% improvement and monotone FID trend to the derived coefficient is not fully supported; other factors may drive the gains.
Authors: We explicitly document this mismatch in the manuscript to present a complete empirical picture. The theory identifies the variance-minimizing coefficient, and our controlled sweeps on 2D benchmarks and latent DiT models confirm that this choice reduces gradient variance as predicted while delivering up to 54% sample-quality gains and monotone FID trends relative to the original loss. Although the FID minimum occurs at the boundary, the interior optimum still substantially outperforms the baseline, supporting that the coefficient correction mitigates a primary source of instability. We do not claim that variance reduction is the sole driver of FID gains and agree that additional optimization or sampling dynamics may contribute. revision: no
-
Referee: [Theory derivation] The central derivation (presumably §3) models the conditional velocity as playing two distinct statistical roles and derives a closed-form coefficient that corrects only the control-variate role. It is unclear from the provided description whether this correction preserves unbiasedness of the regression target or introduces a new bias term; an explicit expansion of the loss and the Jacobi-vector product is needed to confirm that the optimum does not trade one source of bias for another.
Authors: Section 3 separates the roles: the conditional velocity remains the unbiased regression target, while the derived coefficient optimizes only its use as a Monte Carlo control variate inside the Jacobi-vector product. The modification is variance-reducing and does not change the expectation of the estimator, thereby preserving unbiasedness of the overall gradient. To make this fully transparent, we will add an appendix containing the explicit expansion of the loss and the Jacobi-vector product term. revision: yes
Circularity Check
Closed-form derivation of optimal coefficient is self-contained from identified dual roles
full rationale
The paper identifies the conditional velocity field as playing two distinct statistical roles (unbiased regression target and Monte Carlo control variate in the Jacobi-vector product) and derives the optimal coefficient in closed form directly from this modeling choice. No step reduces the result to a fitted parameter, post-hoc data, or self-citation chain; the derivation is presented as first-principles analysis of the loss, with experiments serving only to validate the predicted bias-variance ordering rather than to construct the coefficient itself. The reported FID-MSE mismatch concerns empirical alignment with downstream metrics but does not render the mathematical derivation circular or equivalent to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Conditional velocity field serves as unbiased regression target in the loss
- domain assumption Conditional velocity field serves as Monte Carlo control variate inside Jacobi-vector product
Reference graph
Works this paper leans on
-
[1]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 6840--6851. Curran Associates, Inc., 2020
work page 2020
-
[2]
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations , 2021
work page 2021
-
[3]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems , volume 35, 2022
work page 2022
-
[4]
Diffusion models beat GAN s on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GAN s on image synthesis. In Advances in Neural Information Processing Systems , volume 34, 2021
work page 2021
-
[5]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M \"u ller, Harry Saini, Yam Levi, Dominik Lorenz, Naveen Rafi, Tim Shafir, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of Proceedings of Machine Learning Research , 2024
work page 2024
-
[6]
Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems , volume 31, 2018
work page 2018
-
[7]
Variational inference with normalizing flows
Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning , volume 37 of Proceedings of Machine Learning Research , pages 1530--1538, 2015
work page 2015
-
[8]
Density estimation using R eal- NVP
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using R eal- NVP . In International Conference on Learning Representations , 2017
work page 2017
-
[9]
Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD : Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations , 2019
work page 2019
-
[10]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023
work page 2023
-
[11]
Improving and generalizing flow-based generative models with minibatch optimal transport, 2024
Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport, 2024
work page 2024
-
[12]
Albergo and Eric Vanden-Eijnden
Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2023
work page 2023
-
[13]
Boffi, and Eric Vanden-Eijnden
Michael Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research , 26(209):1--80, 2025
work page 2025
-
[14]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems , volume 32, 2019
work page 2019
-
[15]
Progressive distillation for fast sampling of diffusion models, 2022
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022
work page 2022
-
[16]
Tianwei Yin, Micha \"e l Gharbi, Richard Zhang, Eli Shechtman, Fr \'e do Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6613--6623, June 2024
work page 2024
-
[17]
Tianwei Yin, Micha\" e l Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fr\' e do Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems , volume 37, pages 4745...
work page 2024
-
[18]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning , ICML'23. JMLR.org, 2023
work page 2023
-
[19]
Simplifying, stabilizing and scaling continuous-time consistency models, 2025
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models, 2025
work page 2025
-
[20]
Nicholas M. Boffi, Michael S. Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models, 2025
work page 2025
-
[21]
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling, 2025
work page 2025
-
[22]
Alphaflow: Understanding and improving meanflow models, 2025
Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Skorokhodov. Alphaflow: Understanding and improving meanflow models, 2025
work page 2025
-
[23]
Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models, 2025
work page 2025
-
[24]
Overcoming the curvature bottleneck in meanflow, 2026
Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Chengzhi Mao, Dimitris Metaxas, and Vladimir Pavlovic. Overcoming the curvature bottleneck in meanflow, 2026
work page 2026
-
[25]
Terminal velocity matching, 2026
Linqi Zhou, Mathias Parger, Ayaan Haque, and Jiaming Song. Terminal velocity matching, 2026
work page 2026
-
[26]
Functional mean flow in hilbert space, 2025
Zhiqi Li, Yuchen Sun, Greg Turk, and Bo Zhu. Functional mean flow in hilbert space, 2025
work page 2025
-
[27]
Monte Carlo methods in financial engineering , volume 53
Paul Glasserman. Monte Carlo methods in financial engineering , volume 53. Springer New York, NY, 2003
work page 2003
-
[28]
Estimation with quadratic loss
William James, Charles Stein, et al. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability , volume 1, pages 361--379. University of California Press, 1961
work page 1961
-
[29]
Modular meanflow: Towards stable and scalable one-step generative modeling, 2025
Haochen You, Baojing Liu, and Hongyang He. Modular meanflow: Towards stable and scalable one-step generative modeling, 2025
work page 2025
-
[30]
Understanding, accelerating, and improving meanflow training, 2025
Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, and Dominik Narnhofer. Understanding, accelerating, and improving meanflow training, 2025
work page 2025
-
[31]
Decoupled meanflow: Turning flow models into flow maps for accelerated sampling, 2025
Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for accelerated sampling, 2025
work page 2025
-
[32]
Stable velocity: A variance perspective on flow matching, 2026
Donglin Yang, Yongxing Zhang, Xin Yu, Liang Hou, Xin Tao, Pengfei Wan, Xiaojuan Qi, and Renjie Liao. Stable velocity: A variance perspective on flow matching, 2026
work page 2026
-
[33]
Temporal pair consistency for variance-reduced flow matching, 2026
Chika Maduabuchi and Jindong Wang. Temporal pair consistency for variance-reduced flow matching, 2026
work page 2026
-
[34]
Preconditioned score and flow matching, 2026
Shadab Ahamed, Eshed Gal, Simon Ghyselincks, Md Shahriar Rahim Siddiqui, Moshe Eliasof, and Eldad Haber. Preconditioned score and flow matching, 2026
work page 2026
-
[35]
On the closed-form of flow matching: Generalization does not arise from target stochasticity, 2025
Quentin Bertrand, Anne Gagneux, Mathurin Massias, and R \'e mi Emonet. On the closed-form of flow matching: Generalization does not arise from target stochasticity, 2025
work page 2025
-
[36]
Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022
work page 2022
-
[37]
Improving the training of rectified flows
Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the training of rectified flows. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems , volume 37, pages 63082--63109. Curran Associates, Inc., 2024
work page 2024
-
[38]
Peter W. Glynn and Roberto Szechtman. Some new perspectives on the method of control variates. Monte Carlo and Quasi-Monte Carlo Methods 2000 , pages 27--49, 2002
work page 2000
-
[39]
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature , 518:529--533, 2015
work page 2015
-
[40]
Bootstrap your own latent: A new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altch \'e , et al. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS , 2020
work page 2020
-
[41]
Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization , 30(4):838--855, 1992
work page 1992
-
[42]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 4195--4205, 2023
work page 2023
-
[43]
Dongyeop Woo, Marta Skreta, Seonghyun Park, Kirill Neklyudov, and Sungsoo Ahn. Riemannian meanflow, 2026
work page 2026
-
[44]
One step diffusion via shortcut models, 2025
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models, 2025
work page 2025
-
[45]
Optimal Transport: Old and New , volume 338
C \'e dric Villani et al. Optimal Transport: Old and New , volume 338. Springer Berlin, Heidelberg, 2009
work page 2009
-
[46]
Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results. In Advances in Neural Information Processing Systems , volume 30, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.