pith. sign in

arxiv: 2605.09235 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI· stat.ML

On Variance Reduction in Learning Mean Flows

Pith reviewed 2026-05-12 05:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords meanflowvariance reductioncontrol variateflow matchingone-step generationgenerative modelsdiffusion transformer
0
0 comments X

The pith

Correcting the coefficient on the conditional velocity field stabilizes MeanFlow training and improves sample quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MeanFlow training for one-step generative models suffers from instability, with losses that do not decrease and gradients whose variance grows without bound. The root cause is that the conditional velocity field serves two statistical roles at once in the training objective: it is the target for a regression loss and it also acts as a control variate inside a Monte Carlo estimate of a vector-Jacobian product. The original formulation uses an incorrect scaling for the second role. Deriving the statistically optimal scaling in closed form both explains why several recent ad-hoc fixes work and delivers measurable gains in sample quality. A controlled study on toy problems and on a Diffusion Transformer shows that the corrected coefficient yields up to 54 percent better samples on two-dimensional data and produces steadily improving FID scores across training checkpoints.

Core claim

The paper establishes that the pathology of MeanFlow training originates from an incorrect coefficient multiplying the conditional velocity field inside the loss. This field simultaneously provides the regression target and serves as a Monte Carlo control variate for the Jacobi-vector product; the original loss assigns it the wrong weight for the control-variate term. The authors derive the optimal coefficient in closed form and demonstrate that a range of concurrent stabilization techniques are merely different practical implementations of this same optimum. Empirical sweeps on two-dimensional benchmarks and latent Diffusion Transformers recover the predicted ordering of bias and variance.

What carries the argument

The closed-form optimal coefficient that correctly weights the conditional velocity field when it functions as a control variate in the vector-Jacobian product term of the loss.

Load-bearing premise

The conditional velocity field simultaneously acts as an unbiased regression target and as a Monte Carlo control variate whose coefficient in the loss must be chosen separately from the regression term.

What would settle it

If training with the derived optimal coefficient on the reported two-dimensional benchmarks fails to produce both lower gradient variance and higher sample quality than the original MeanFlow loss, the attribution of the instability to the mis-specified coefficient would be falsified.

Figures

Figures reproduced from arXiv: 2605.09235 by Juanwu Lu, Ziran Wang.

Figure 1
Figure 1. Figure 1: Spatial distribution of p Tr(Σv′ |xt) = p Ex0|xt ∥v ′∥ 2 at three timesteps on a two￾dimensional Gaussian mixture. Conditional variances concentrate in mode-mixing regions. In the original MeanFlow, the stop-gradient operator prevents J from being passed to the optimizer, leading to an empirical non-decreasing loss. Meanwhile, the mean-field difference vanishes at con￾vergence (rθ → 0) and the variance-dri… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical total gradient variance Tr(Cov[∇θℓMF]) on six two-dimensional toy datasets with β ∈ {0, 0.25, 0.5, 0.75, 1}. The monotonic decrease of variances with respect to β on almost every dataset aligns with the prediction in Theorem 2. sion target vcond unchanged, exploiting the role asymmetry identified in section 3.1. Appendix C provides details about the full loss, training algorithm, and three propos… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical sample-quality measured by sliced Wasserstein- [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experiment results training DiT-B/4 on ImageNet- [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 100 class-conditional samples from the β = 0 baseline checkpoint at step 300k (FID 11.37). Same noise seed and class labels as figs. 6 and 7. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 100 class-conditional samples from the β = 0.5 checkpoint at step 300k (FID 12.51). Same noise seed and class labels as figs. 5 and 7. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: 100 class-conditional samples from the β = 1 corner checkpoint at step 300k (FID 23.36). Same noise seed and class labels as figs. 5 and 6. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
read the original abstract

One-step generative modeling has emerged as a leading approach to amortize the inference cost of diffusion and flow-matching models. Among distillation-free methods, MeanFlow training is notoriously unstable, with non-decreasing loss and unbounded gradient variance. In this work, we establish a theory that attributes this pathology to a misuse of the conditional velocity field: it plays two distinct statistical roles in the loss, both as an unbiased regression target and as a Monte Carlo control variate inside a Jacobi-vector product, with the original loss assigning the wrong coefficient to the latter. We derive the optimal coefficient in closed form, and show that a family of fixes in concurrent works corresponds to different practical realizations of the same optimum. A controlled sweep of this coefficient on two-dimensional benchmarks and on a latent Diffusion Transformer recovers the predicted bias-variance ordering. The optimal coefficient yields up to a %54 improvement in sample quality on two-dimensional benchmarks and a monotone FID trend at every matched-step DiT checkpoint. Crucially, the same DiT measurement also reveals a quantitative FID-MSE landscape mismatch: although gradient variance is minimized at an interior coefficient value, the coefficient that minimizes FID prefers the direct use of conditional velocity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that instability in MeanFlow training stems from the conditional velocity field being assigned an incorrect coefficient in the loss, as it simultaneously serves as an unbiased regression target and a Monte Carlo control variate within the Jacobi-vector product. The authors derive the variance-optimal coefficient in closed form, unify several concurrent fixes as alternative realizations of the same optimum, and report that controlled coefficient sweeps on 2D benchmarks and latent DiT models recover the predicted bias-variance ordering, deliver up to 54% sample-quality gains, and produce monotone FID trends, while explicitly noting a quantitative mismatch in which gradient variance is minimized at an interior coefficient but FID is minimized at the boundary value corresponding to direct use of the conditional velocity.

Significance. If the derivation is sound, the work supplies a principled statistical account of MeanFlow pathology and a parameter-free correction that could stabilize one-step generative models while explaining prior heuristics. The closed-form result and the unification of concurrent methods are clear strengths. The reported empirical mismatch between the variance minimum and the FID minimum, however, weakens the causal attribution of quality gains to the proposed coefficient and suggests that unmodeled optimization or sampling dynamics may be responsible for the observed improvements.

major comments (2)
  1. [Abstract] Abstract: the manuscript states that gradient variance is minimized at an interior coefficient while FID is minimized by the boundary value that recovers direct conditional-velocity use. This quantitative mismatch between the quantity optimized by the theory (gradient variance) and the downstream metric (FID/sample quality) means the attribution of the reported 54% improvement and monotone FID trend to the derived coefficient is not fully supported; other factors may drive the gains.
  2. [Theory derivation] The central derivation (presumably §3) models the conditional velocity as playing two distinct statistical roles and derives a closed-form coefficient that corrects only the control-variate role. It is unclear from the provided description whether this correction preserves unbiasedness of the regression target or introduces a new bias term; an explicit expansion of the loss and the Jacobi-vector product is needed to confirm that the optimum does not trade one source of bias for another.
minor comments (2)
  1. [Abstract] The abstract contains a typographical error: '%54' should read '54%'.
  2. [Experiments] Experimental sections should include the precise definition of the coefficient sweep range, the exact DiT checkpoint matching procedure, and raw variance/FID values (not only trends) to allow independent verification of the bias-variance ordering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the manuscript states that gradient variance is minimized at an interior coefficient while FID is minimized by the boundary value that recovers direct conditional-velocity use. This quantitative mismatch between the quantity optimized by the theory (gradient variance) and the downstream metric (FID/sample quality) means the attribution of the reported 54% improvement and monotone FID trend to the derived coefficient is not fully supported; other factors may drive the gains.

    Authors: We explicitly document this mismatch in the manuscript to present a complete empirical picture. The theory identifies the variance-minimizing coefficient, and our controlled sweeps on 2D benchmarks and latent DiT models confirm that this choice reduces gradient variance as predicted while delivering up to 54% sample-quality gains and monotone FID trends relative to the original loss. Although the FID minimum occurs at the boundary, the interior optimum still substantially outperforms the baseline, supporting that the coefficient correction mitigates a primary source of instability. We do not claim that variance reduction is the sole driver of FID gains and agree that additional optimization or sampling dynamics may contribute. revision: no

  2. Referee: [Theory derivation] The central derivation (presumably §3) models the conditional velocity as playing two distinct statistical roles and derives a closed-form coefficient that corrects only the control-variate role. It is unclear from the provided description whether this correction preserves unbiasedness of the regression target or introduces a new bias term; an explicit expansion of the loss and the Jacobi-vector product is needed to confirm that the optimum does not trade one source of bias for another.

    Authors: Section 3 separates the roles: the conditional velocity remains the unbiased regression target, while the derived coefficient optimizes only its use as a Monte Carlo control variate inside the Jacobi-vector product. The modification is variance-reducing and does not change the expectation of the estimator, thereby preserving unbiasedness of the overall gradient. To make this fully transparent, we will add an appendix containing the explicit expansion of the loss and the Jacobi-vector product term. revision: yes

Circularity Check

0 steps flagged

Closed-form derivation of optimal coefficient is self-contained from identified dual roles

full rationale

The paper identifies the conditional velocity field as playing two distinct statistical roles (unbiased regression target and Monte Carlo control variate in the Jacobi-vector product) and derives the optimal coefficient in closed form directly from this modeling choice. No step reduces the result to a fitted parameter, post-hoc data, or self-citation chain; the derivation is presented as first-principles analysis of the loss, with experiments serving only to validate the predicted bias-variance ordering rather than to construct the coefficient itself. The reported FID-MSE mismatch concerns empirical alignment with downstream metrics but does not render the mathematical derivation circular or equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from Monte Carlo control variates and regression in generative modeling; no new free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Conditional velocity field serves as unbiased regression target in the loss
    Core statistical role invoked to explain the original pathology.
  • domain assumption Conditional velocity field serves as Monte Carlo control variate inside Jacobi-vector product
    Second statistical role whose coefficient was misassigned in the original loss.

pith-pipeline@v0.9.0 · 5500 in / 1136 out tokens · 64776 ms · 2026-05-12T05:08:58.010667+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 6840--6851. Curran Associates, Inc., 2020

  2. [2]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations , 2021

  3. [3]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems , volume 35, 2022

  4. [4]

    Diffusion models beat GAN s on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GAN s on image synthesis. In Advances in Neural Information Processing Systems , volume 34, 2021

  5. [5]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M \"u ller, Harry Saini, Yam Levi, Dominik Lorenz, Naveen Rafi, Tim Shafir, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of Proceedings of Machine Learning Research , 2024

  6. [6]

    Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems , volume 31, 2018

  7. [7]

    Variational inference with normalizing flows

    Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning , volume 37 of Proceedings of Machine Learning Research , pages 1530--1538, 2015

  8. [8]

    Density estimation using R eal- NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using R eal- NVP . In International Conference on Learning Representations , 2017

  9. [9]

    Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD : Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations , 2019

  10. [10]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023

  11. [11]

    Improving and generalizing flow-based generative models with minibatch optimal transport, 2024

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport, 2024

  12. [12]

    Albergo and Eric Vanden-Eijnden

    Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2023

  13. [13]

    Boffi, and Eric Vanden-Eijnden

    Michael Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research , 26(209):1--80, 2025

  14. [14]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems , volume 32, 2019

  15. [15]

    Progressive distillation for fast sampling of diffusion models, 2022

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022

  16. [16]

    Freeman, and Taesung Park

    Tianwei Yin, Micha \"e l Gharbi, Richard Zhang, Eli Shechtman, Fr \'e do Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6613--6623, June 2024

  17. [17]

    Tianwei Yin, Micha\" e l Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fr\' e do Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems , volume 37, pages 4745...

  18. [18]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning , ICML'23. JMLR.org, 2023

  19. [19]

    Simplifying, stabilizing and scaling continuous-time consistency models, 2025

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models, 2025

  20. [20]

    Boffi, Michael S

    Nicholas M. Boffi, Michael S. Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models, 2025

  21. [21]

    Zico Kolter, and Kaiming He

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling, 2025

  22. [22]

    Alphaflow: Understanding and improving meanflow models, 2025

    Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Skorokhodov. Alphaflow: Understanding and improving meanflow models, 2025

  23. [23]

    Zico Kolter, and Kaiming He

    Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models, 2025

  24. [24]

    Overcoming the curvature bottleneck in meanflow, 2026

    Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Chengzhi Mao, Dimitris Metaxas, and Vladimir Pavlovic. Overcoming the curvature bottleneck in meanflow, 2026

  25. [25]

    Terminal velocity matching, 2026

    Linqi Zhou, Mathias Parger, Ayaan Haque, and Jiaming Song. Terminal velocity matching, 2026

  26. [26]

    Functional mean flow in hilbert space, 2025

    Zhiqi Li, Yuchen Sun, Greg Turk, and Bo Zhu. Functional mean flow in hilbert space, 2025

  27. [27]

    Monte Carlo methods in financial engineering , volume 53

    Paul Glasserman. Monte Carlo methods in financial engineering , volume 53. Springer New York, NY, 2003

  28. [28]

    Estimation with quadratic loss

    William James, Charles Stein, et al. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability , volume 1, pages 361--379. University of California Press, 1961

  29. [29]

    Modular meanflow: Towards stable and scalable one-step generative modeling, 2025

    Haochen You, Baojing Liu, and Hongyang He. Modular meanflow: Towards stable and scalable one-step generative modeling, 2025

  30. [30]

    Understanding, accelerating, and improving meanflow training, 2025

    Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, and Dominik Narnhofer. Understanding, accelerating, and improving meanflow training, 2025

  31. [31]

    Decoupled meanflow: Turning flow models into flow maps for accelerated sampling, 2025

    Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for accelerated sampling, 2025

  32. [32]

    Stable velocity: A variance perspective on flow matching, 2026

    Donglin Yang, Yongxing Zhang, Xin Yu, Liang Hou, Xin Tao, Pengfei Wan, Xiaojuan Qi, and Renjie Liao. Stable velocity: A variance perspective on flow matching, 2026

  33. [33]

    Temporal pair consistency for variance-reduced flow matching, 2026

    Chika Maduabuchi and Jindong Wang. Temporal pair consistency for variance-reduced flow matching, 2026

  34. [34]

    Preconditioned score and flow matching, 2026

    Shadab Ahamed, Eshed Gal, Simon Ghyselincks, Md Shahriar Rahim Siddiqui, Moshe Eliasof, and Eldad Haber. Preconditioned score and flow matching, 2026

  35. [35]

    On the closed-form of flow matching: Generalization does not arise from target stochasticity, 2025

    Quentin Bertrand, Anne Gagneux, Mathurin Massias, and R \'e mi Emonet. On the closed-form of flow matching: Generalization does not arise from target stochasticity, 2025

  36. [36]

    Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

  37. [37]

    Improving the training of rectified flows

    Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the training of rectified flows. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems , volume 37, pages 63082--63109. Curran Associates, Inc., 2024

  38. [38]

    Glynn and Roberto Szechtman

    Peter W. Glynn and Roberto Szechtman. Some new perspectives on the method of control variates. Monte Carlo and Quasi-Monte Carlo Methods 2000 , pages 27--49, 2002

  39. [39]

    Human-level control through deep reinforcement learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature , 518:529--533, 2015

  40. [40]

    Bootstrap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch \'e , et al. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS , 2020

  41. [41]

    Polyak and Anatoli B

    Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization , 30(4):838--855, 1992

  42. [42]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 4195--4205, 2023

  43. [43]

    Riemannian meanflow, 2026

    Dongyeop Woo, Marta Skreta, Seonghyun Park, Kirill Neklyudov, and Sungsoo Ahn. Riemannian meanflow, 2026

  44. [44]

    One step diffusion via shortcut models, 2025

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models, 2025

  45. [45]

    Optimal Transport: Old and New , volume 338

    C \'e dric Villani et al. Optimal Transport: Old and New , volume 338. Springer Berlin, Heidelberg, 2009

  46. [46]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results. In Advances in Neural Information Processing Systems , volume 30, 2017