One-Step Generative Modeling via Wasserstein Gradient Flows

Emmanuel J. Cand\`es; Jiaqi Han; Puheng Li; Qiushan Guo; Renyuan Xu; Stefano Ermon

arxiv: 2605.11755 · v2 · pith:GDCQ5IT2new · submitted 2026-05-12 · 💻 cs.LG · cs.CV· stat.ML

One-Step Generative Modeling via Wasserstein Gradient Flows

Jiaqi Han , Puheng Li , Qiushan Guo , Renyuan Xu , Stefano Ermon , Emmanuel J. Cand\`es This is my paper

Pith reviewed 2026-05-13 07:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords generative modelingWasserstein gradient flowone-step generationSinkhorn divergenceoptimal transportdiffusion modelsImageNet

0 comments

The pith

W-Flow achieves one-step ImageNet 256x256 generation at 1.29 FID by training a neural network to compress a Wasserstein gradient flow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces W-Flow to create a generator that maps simple reference samples to target data in a single forward pass. It first evolves the reference distribution toward the target by following a Wasserstein gradient flow that minimizes an energy functional given by the Sinkhorn divergence. A static neural network is then trained to approximate the full continuous evolution at once. This produces better mode coverage and domain transfer than prior one-step methods while delivering sampling speeds roughly 100 times faster than multi-step diffusion models with comparable quality. A reader would care because the approach replaces expensive iterative sampling with a principled, transport-based shortcut that still reaches high fidelity.

Core claim

W-Flow defines an evolution from reference to target distribution through a Wasserstein gradient flow minimizing the Sinkhorn divergence energy functional, then trains a static neural generator to realize this entire evolution in one step. The finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically the resulting model reaches 1.29 FID on one-step ImageNet 256x256 generation, improves mode coverage and domain transfer, and yields approximately 100 times faster sampling than multi-step diffusion models with similar FID scores.

What carries the argument

The Wasserstein gradient flow of the Sinkhorn divergence energy functional, compressed into a single forward pass by a static neural generator.

Load-bearing premise

Finite-sample training dynamics converge to the continuous-time Wasserstein gradient flow dynamics under suitable assumptions.

What would settle it

A direct comparison showing that samples from the trained one-step generator deviate from the distribution reached by running the full multi-step Wasserstein flow on the same reference inputs.

Figures

Figures reproduced from arXiv: 2605.11755 by Emmanuel J. Cand\`es, Jiaqi Han, Puheng Li, Qiushan Guo, Renyuan Xu, Stefano Ermon.

**Figure 1.** Figure 1: (Left) 1-NFE samples from W-Flow-L/2 trained from scratch on ImageNet-256×256. (Right) Sample quality (measured by FID) vs. effective sampling compute [39] (billion parameters × number of function evaluations during sampling) evaluated on ImageNet 256×256. target distribution in one step. This would combine the efficiency of one-step generation with the flexibility of a distributional evolution during trai… view at source ↗

**Figure 2.** Figure 2: (a) The conceptual diagram of W-Flow. (b) Visualization of the training dynamics projected onto the Sinkhorn divergence landscape on 8 Gaussian mixtures, shown on a logarithmic scale. ing a few/one-step generator from scratch, typically by enforcing certain self-consistency conditions on the trajectory [18, 19, 4, 55] or the intermediate marginals [70]. These methods largely inherit their training signal f… view at source ↗

**Figure 3.** Figure 3: Comparison between onebatch and two-batch estimators on learning a 2D Gaussian. where Π(qbt, pb) is the set of matrices with prescribed marginals. Denote the optimal solution π ε,∗ qbt,pb . Two-batch estimate for self-transport. Naïvely estimating the self-entropic OT term OTε(qbt, qbt) from a single empirical batch introduces a self-matching artifact: since each particle can be matched to itself at zero … view at source ↗

**Figure 4.** Figure 4: Classifier-free guidance. Left: The FID and Inception Score curve when sweeping over CFG scales. Right: Image samples by W-Flow, L/2 with CFG increasing from 0.0 to 2.0. 1-NFE sampling, W-Flow outperforms most diffusion models requiring up to 250 steps, such as LightningDiT-XL/2. These strong empirical results support our central claim that principled WGF dynamics can translate into exceptional generation … view at source ↗

**Figure 5.** Figure 5: (a) Oval-to-circle domain transfer. Source and target are constructed by sampling angles uniformly from [0, 2π) with parametric curves corrupted by Gaussian noise. (b) & (c) One-step facial age translation on FFHQ, mapping older faces to younger ones. (b) Histogram of the latent ℓ2 distance between 2,000 source images and their generated targets. (c) Visual comparison. (a) Drifting (b) W-Flow Drifting W-Fl… view at source ↗

**Figure 6.** Figure 6: Evaluation of mode coverage under imbalanced target distributions. (a) Evaluation of mode coverage on a 2D Gaussian mixtures dataset featuring six dominant modes and two distant minority modes. (b) PCA scatter plot of generated latent codes for an artificially imbalanced FFHQ target distribution (95% senior faces, 5% child faces). See Appendix F for generated samples showing the comparison of mode coverage… view at source ↗

**Figure 7.** Figure 7: Evaluation of self-transport estimators on a 2D Gaussian mixtures dataset featuring six [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of velocity guidance and distribution guidance for conditional generation on a [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Illustrations on the difference in the velocity field computation between Drifting Model [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Uncurated samples generated by W-Flow, L/2 with CFG [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Uncurated samples generated by W-Flow, XL/2 [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

**Figure 12.** Figure 12: Uncurated samples generated by W-Flow, XL/2 with CFG [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗

**Figure 13.** Figure 13: Uncurated samples generated by Drifting Model in the mode coverage experiment [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗

**Figure 14.** Figure 14: Uncurated samples generated by W-Flow in the mode coverage experiment (Sec. [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗

read the original abstract

Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

W-Flow shows how to compress a Wasserstein gradient flow into a one-step generator and hits 1.29 FID on ImageNet 256x256, but the convergence claim rests on unspecified assumptions.

read the letter

Here's the quick take on this W-Flow paper: it gives a clean way to build a one-step generator by compressing a Wasserstein gradient flow trajectory, and the ImageNet results are competitive enough to notice. They define the flow using Sinkhorn divergence as the energy functional, which provides an efficient way to measure and minimize distributional differences. Then they train a neural network to perform the entire mapping from reference to target in a single forward pass. This is new compared to standard diffusion or CNF approaches, as it avoids multiple steps entirely. The paper does well on the practical side. The reported 1.29 FID score for 256x256 ImageNet generation, along with better mode coverage and domain transfer, shows that the method works in practice. The 100x speedup over comparable diffusion models is a real advantage for applications needing quick sampling. Where it gets softer is the convergence result. The claim that finite-sample training dynamics match the continuous Wasserstein flow relies on suitable assumptions that aren't spelled out in detail. For complex image distributions, those assumptions around boundedness or convergence rates might not hold without additional checks, so the theoretical guarantee feels a bit loose until the full proof is examined. This work is for folks in generative modeling who care about inference speed without sacrificing too much quality. It deserves a serious referee because the empirical claims are specific and the core idea is well-motivated, even if the theory needs tightening on the assumptions.

Referee Report

1 major / 2 minor

Summary. The paper introduces W-Flow, a two-stage framework that first evolves samples from a reference distribution to a target data distribution via a Wasserstein gradient flow minimizing a Sinkhorn-divergence energy functional, then trains a static neural generator to compress this continuous evolution into a single forward pass. It asserts a convergence result for finite-sample training dynamics to the continuous-time flow under suitable assumptions, and reports new state-of-the-art one-step performance on ImageNet 256×256 (1.29 FID) together with improved mode coverage, domain transfer, and roughly 100× faster sampling than multi-step diffusion models of comparable FID.

Significance. If the convergence result can be made rigorous and the empirical gains hold under controlled ablations, the work would supply a principled optimal-transport route to high-fidelity one-step generation that improves upon both diffusion and existing one-step baselines in coverage and speed, with potential impact on downstream tasks requiring fast sampling.

major comments (1)

[Abstract and convergence theorem] Abstract and theoretical development: the central claim that the trained one-step generator faithfully realizes the Wasserstein flow rests on a convergence statement for finite-sample dynamics that is conditioned on unspecified 'suitable assumptions.' Because the 1.29 FID result is presented as evidence that the discrete network compresses the continuous dynamics, the precise conditions (regularity of the energy functional, Lipschitz bounds on the velocity field, uniform convergence rates of empirical measures, or control of discretization error in 256×256 image space) must be stated explicitly and shown to be satisfied; without them the link between theory and the reported FID remains unverified.

minor comments (2)

[Method section] The precise definition of the Sinkhorn-regularized energy functional and the architecture/hyper-parameters of the one-step generator should be moved from supplementary material into the main text to support reproducibility of the 1.29 FID number.
[Experiments] Figure captions and experimental tables should explicitly report the number of function evaluations and wall-clock time per sample when claiming the 100× speedup relative to diffusion baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The feedback on clarifying the convergence result is well-taken and will strengthen the manuscript. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract and convergence theorem] Abstract and theoretical development: the central claim that the trained one-step generator faithfully realizes the Wasserstein flow rests on a convergence statement for finite-sample dynamics that is conditioned on unspecified 'suitable assumptions.' Because the 1.29 FID result is presented as evidence that the discrete network compresses the continuous dynamics, the precise conditions (regularity of the energy functional, Lipschitz bounds on the velocity field, uniform convergence rates of empirical measures, or control of discretization error in 256×256 image space) must be stated explicitly and shown to be satisfied; without them the link between theory and the reported FID remains unverified.

Authors: We agree that the assumptions require explicit statement to make the theoretical-empirical connection rigorous. In the revision we will expand the theorem (Section 3) to list them verbatim: (i) the Sinkhorn energy is λ-convex and C²-smooth w.r.t. the 2-Wasserstein metric for ε>0; (ii) the resulting velocity field is globally L-Lipschitz; (iii) the empirical measures satisfy a uniform Glivenko–Cantelli property with rate O(n^{-1/2} log n) under the covering numbers of the RKHS induced by the kernel; (iv) the Euler–Maruyama discretization error is O(Δt) uniformly on compact time intervals when the velocity is bounded. We will add a short verification paragraph showing that (i)–(iii) hold for the entropic Sinkhorn divergence on the image manifold (citing standard OT regularity results) and that (iv) is controlled by our chosen step-size schedule. The 1.29 FID remains an empirical illustration of practical performance; the revised theorem will now make the approximation guarantee precise rather than conditional on unspecified assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper defines an evolution via Wasserstein gradient flow minimizing an energy functional instantiated with Sinkhorn divergence, then trains a neural generator to compress the flow into one step. This is a standard two-stage procedure using established optimal transport geometry and neural approximation; the claimed one-step generator is optimized against the flow rather than defined to equal it by construction. The convergence of finite-sample dynamics is asserted under suitable assumptions without any equation reducing the reported FID or sampling speed directly to a fitted internal parameter. No load-bearing self-citation, uniqueness theorem imported from prior author work, or ansatz smuggled via citation appears in the provided text. The ImageNet results are presented as empirical outcomes, not forced predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a Wasserstein gradient flow for the chosen energy and on the ability of a neural network to approximate its finite-time evolution; both are standard in the literature but invoked without new justification here.

free parameters (1)

Sinkhorn regularization strength
Controls the approximation quality of the divergence and must be chosen or tuned for each dataset.

axioms (1)

domain assumption Finite-sample training dynamics converge to continuous-time distributional dynamics under suitable assumptions
Invoked to justify that the trained generator faithfully follows the flow; assumptions left unspecified in abstract.

pith-pipeline@v0.9.0 · 5541 in / 1304 out tokens · 44302 ms · 2026-05-13T07:41:20.795142+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model the evolution of {q(k)} via a WGF... V_t(x) = -∇ δF/δq (q_t)(x) ... instantiate ... Sinkhorn divergence ... prove finite-sample training dynamics converge ... under suitable assumptions (Assumption A.1)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1 (Informal) ... sup W2(bqN,M,η_t , q_t) → 0 as η→0, N,M→∞

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.