Generative Modeling by Minimizing the Wasserstein-2 Loss

Yu-Jui Huang; Zachariah Malik

arxiv: 2406.13619 · v4 · pith:2E3BXI4Rnew · submitted 2024-06-19 · 📊 stat.ML · cs.LG

Generative Modeling by Minimizing the Wasserstein-2 Loss

Yu-Jui Huang , Zachariah Malik This is my paper

Pith reviewed 2026-05-23 23:51 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Wasserstein distancegenerative modelinggradient flowKantorovich potentialordinary differential equationoptimal transportpersistent training

0 comments

The pith

The time-marginal laws of a distribution-dependent ODE form a gradient flow for the Wasserstein-2 loss that converges exponentially to the data distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a generative model that minimizes the second-order Wasserstein loss by evolving samples according to a distribution-dependent ordinary differential equation. The ODE uses the Kantorovich potential between the current distribution estimate and the true data distribution to set its dynamics. A main theorem establishes that the marginal distributions of solutions to this ODE follow the gradient flow of the W2 loss and approach the target distribution at an exponential rate. The authors then discretize the ODE via an Euler scheme, prove that the scheme recovers the continuous gradient flow in the limit, and build a practical algorithm that incorporates persistent training to approximate the potentials on the fly.

Core claim

The time-marginal laws of the distribution-dependent ODE form a gradient flow for the W2 loss, which converges exponentially to the true data distribution. An Euler scheme for the ODE recovers the gradient flow in the limit.

What carries the argument

The distribution-dependent ordinary differential equation whose vector field is defined by the Kantorovich potential between the current estimate and the true data distribution.

If this is right

The Euler scheme converges to the continuous gradient flow of the W2 loss as the step size tends to zero.
Persistent training in the resulting algorithm produces samples whose quality exceeds that of Wasserstein GANs in both low- and high-dimensional settings when the persistence level is raised.
The method supplies an explicit continuous-time dynamics for directly minimizing the W2 loss without an adversarial objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Replacing the W2 cost with other optimal-transport distances would require only a change in the dual potential used to drive the ODE.
The explicit gradient-flow structure may avoid some of the training instabilities that arise in min-max formulations of generative models.
Scalability in high dimensions hinges on fast, accurate approximation of the Kantorovich potential at each step of the discretization.

Load-bearing premise

The Kantorovich potential associated with the true data distribution and the current estimate exists and can be used to define the dynamics of the distribution-dependent ODE.

What would settle it

Numerical simulation of the proposed ODE or its Euler discretization that shows the W2 distance to the data distribution fails to decrease exponentially, or that the generated marginal fails to approach the target, would falsify the central convergence claim.

Figures

Figures reproduced from arXiv: 2406.13619 by Yu-Jui Huang, Zachariah Malik.

**Figure 2.** Figure 2: The first (resp. second) row plots the Wasserstein-1 (resp. Wasserstein-2) loss against [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: 1-NN classifier accuracy against training epoch for domain adaptation from the USPS [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

read the original abstract

This paper develops a generative model by minimizing the second-order Wasserstein loss (the $W_2$ loss) through a distribution-dependent ordinary differential equation (ODE), whose dynamics involves the Kantorovich potential associated with the true data distribution and a current estimate of it. A main result shows that the time-marginal laws of the ODE form a gradient flow for the $W_2$ loss, which converges exponentially to the true data distribution. An Euler scheme for the ODE is proposed and it is shown to recover the gradient flow for the $W_2$ loss in the limit. An algorithm is designed by following the scheme and applying persistent training, which naturally fits our gradient-flow approach. In both low- and high-dimensional experiments, our algorithm outperforms Wasserstein generative adversarial networks by increasing the level of persistent training appropriately.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a distribution-dependent ODE whose marginals are claimed to be the W2 gradient flow with exponential convergence, plus an Euler scheme and persistent-training algorithm, but the ODE well-posedness rests on missing regularity assumptions.

read the letter

The new element is the specific ODE driven by the Kantorovich potential between the running marginal and the fixed target measure, together with the statement that its time-marginals realize the W2 gradient flow and converge exponentially. They then discretize via Euler and turn the scheme into a persistent-training algorithm that they test against WGANs in low- and high-dimensional settings. That construction is not standard in the literature they cite, so the algorithmic route is genuinely different from the usual adversarial or score-based approaches. The experiments are presented as showing better performance once persistent training is increased, which at least gives a concrete implementation to evaluate. The central theoretical claim, however, needs the Kantorovich potential to exist and to produce a velocity field that makes the ODE well-posed. Brenier gives existence only when at least one measure is absolutely continuous, and the resulting map is monotone but not necessarily Lipschitz, so uniqueness of flows is not automatic. The abstract states no conditions on the measures that would guarantee this, and the stress-test concern therefore lands directly on the main result. Without those assumptions the exponential convergence statement is not yet on firm ground. The paper is aimed at people working on Wasserstein generative models who want to avoid adversarial training. It is coherent on its own terms and engages the relevant gradient-flow literature, so it deserves a serious referee who can check whether the full proofs close the regularity gap and whether the experiments actually quantify the claimed improvement. I would send it to review but would ask the authors to add explicit conditions on the measures and to supply the missing error analysis for the Euler limit.

Referee Report

1 major / 0 minor

Summary. The paper develops a generative model minimizing the W2 loss via a distribution-dependent ODE whose velocity is the negative gradient of the Kantorovich potential between the current marginal and the fixed target measure. It claims that the time-marginal laws of solutions to this ODE form a gradient flow for the W2 loss and converge exponentially to the data distribution. An Euler scheme for the ODE is shown to recover the gradient flow in the continuum limit. An algorithm is derived from the scheme using persistent training and is reported to outperform WGANs on low- and high-dimensional experiments.

Significance. If the main convergence result holds under appropriate conditions, the work would supply a non-adversarial, transport-based generative procedure with an explicit link to Wasserstein gradient flows and exponential convergence, together with a practical discretization that appears competitive in experiments. The absence of stated regularity assumptions on the measures, however, leaves the central theoretical claim unsupported in its current form.

major comments (1)

[Abstract / Main Result] Abstract and the statement of the main result: the claim that the time-marginal laws of the distribution-dependent ODE form a W2 gradient flow and converge exponentially requires the ODE to be well-posed. No assumptions are given ensuring existence of the Kantorovich potential (Brenier’s theorem needs at least one measure absolutely continuous) or sufficient regularity of its gradient to guarantee unique flows and the gradient-flow identity. This is load-bearing for the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for explicit regularity assumptions to support the well-posedness of the distribution-dependent ODE. We agree that the central claims require such conditions and will revise the manuscript to state them clearly.

read point-by-point responses

Referee: [Abstract / Main Result] Abstract and the statement of the main result: the claim that the time-marginal laws of the distribution-dependent ODE form a W2 gradient flow and converge exponentially requires the ODE to be well-posed. No assumptions are given ensuring existence of the Kantorovich potential (Brenier’s theorem needs at least one measure absolutely continuous) or sufficient regularity of its gradient to guarantee unique flows and the gradient-flow identity. This is load-bearing for the central claim.

Authors: We agree that the well-posedness of the ODE and the gradient-flow identity require explicit assumptions. By Brenier’s theorem, existence of the Kantorovich potential (unique up to constants) holds when at least one measure is absolutely continuous. To ensure unique flows and the required regularity of the velocity field, we will additionally assume that the potential is C^1 with Lipschitz gradient (or, equivalently, that the measures satisfy suitable moment and density bounds guaranteeing this). In the revision we will insert these hypotheses into the statement of the main theorem, add a short discussion of their necessity and sufficiency, and verify that the Euler-scheme convergence argument continues to hold under them. These additions do not change the algorithmic contribution or the experimental results. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is the standard Wasserstein gradient-flow construction

full rationale

The abstract defines an ODE whose velocity field is given by the Kantorovich potential between the current marginal and the target measure, then states that the resulting marginal flow satisfies the W2 gradient-flow equation. This is the canonical definition of the Wasserstein gradient flow of the squared-distance functional in the space of probability measures; the claimed property follows directly from the construction rather than from any independent derivation that could be circular. No self-citation load-bearing steps, fitted-input predictions, or ansatz smuggling appear in the provided text. The paper is therefore self-contained against external benchmarks in optimal transport, yielding a circularity score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Assessment is limited to the abstract; the central construction rests on the existence of Kantorovich potentials and standard properties of Wasserstein space.

axioms (1)

domain assumption Kantorovich potentials exist for the true data distribution and the evolving model distribution
The ODE dynamics are defined using these potentials.

pith-pipeline@v0.9.0 · 5666 in / 1192 out tokens · 24837 ms · 2026-05-23T23:51:46.884807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

time-marginal laws of the ODE form a gradient flow for the W2 loss, which converges exponentially to the true data distribution
IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Kantorovich potential associated with the true data distribution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Saddle Points Toward Global Minima: A Newton-Type Method on Wasserstein Space
math.OC 2026-05 unverdicted novelty 7.0

Introduces WSFN, a Newton-type method on Wasserstein space that escapes saddle points in polynomial time and achieves linear convergence to global minimizers under benign landscape assumptions.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Ambrosio, N

L. Ambrosio, N. Gigli, and G. Savar ´e, Gradient flows in metric spaces and in the space of probability measures, Lectures in Mathematics ETH Z¨ urich, Birkh¨ auser Verlag, Basel, sec- ond ed., 2008

work page 2008
[2]

Arjovsky, S

M. Arjovsky, S. Chintala, and L. Bottou , Wasserstein generative adversarial networks , in Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh, eds., vol. 70 of Proceedings of Machine Learning Research, PMLR, 06–11 Aug 2017, pp. 214–223

work page 2017
[3]

Barbu and M

V. Barbu and M. R ¨ockner, From nonlinear Fokker-Planck equations to solutions of distri- bution dependent SDE , Ann. Probab., 48 (2020), pp. 1902–1920

work page 2020
[4]

Carmona and F

R. Carmona and F. Delarue , Probabilistic theory of mean field games with applications. I , Springer, Cham, 2018

work page 2018
[5]

Faster SGD training by minibatch persistency

M. Fischetti, I. Mandatelli, and D. Salvagnin , Faster SGD training by minibatch per- sistency, CoRR, abs/1806.07353 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Advances in Neural Informa- tion Processing Systems 27, 2014, pp. 2672–2680

work page 2014
[7]

I. J. Goodfellow , NIPS 2016 tutorial: Generative adversarial networks , (2016). Available at https://arxiv.org/abs/1701.00160

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Huang, S.-C

Y.-J. Huang, S.-C. Lin, Y.-C. Huang, K.-H. Lyu, H.-H. Shen, and W.-Y. Lin , On characterizing optimal Wasserstein GAN solutions for non-Gaussian data , in 2023 IEEE Inter- national Symposium on Information Theory (ISIT), 2023, pp. 909–914

work page 2023
[9]

Huang and Y

Y.-J. Huang and Y. Zhang , GANs as gradient flows that converge , Journal of Machine Learning Research, 24 (2023), pp. 1–40

work page 2023
[10]

Karatzas and S

I. Karatzas and S. E. Shreve, Brownian motion and stochastic calculus, vol. 113 of Graduate Texts in Mathematics, Springer-Verlag, New York, second ed., 1991. 20

work page 1991
[11]

Adversarial Computation of Optimal Transport Maps

J. Leygonie, J. She, A. Almahairi, S. Rajeswar, and A. C. Courville , Adversarial computation of optimal transport maps , CoRR, abs/1906.09691 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[12]

L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein , Unrolled generative adversarial networks, in International Conference on Learning Representations, 2017

work page 2017
[13]

Petzka, A

H. Petzka, A. Fischer, and D. Lukovnikov , On the regularization of wasserstein GANs , in International Conference on Learning Representations, 2018

work page 2018
[14]

Santambrogio, Optimal transport for applied mathematicians, vol

F. Santambrogio, Optimal transport for applied mathematicians, vol. 87 of Progress in Nonlin- ear Differential Equations and their Applications, Birkh¨ auser/Springer, Cham, 2015. Calculus of variations, PDEs, and modeling

work page 2015
[15]

Seguy, B

V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel , Large scale optimal transport and mapping estimation , in International Conference on Learning Representations, 2018

work page 2018
[16]

Trevisan, Well-posedness of multidimensional diffusion processes with weakly differentiable coefficients, Electron

D. Trevisan, Well-posedness of multidimensional diffusion processes with weakly differentiable coefficients, Electron. J. Probab., 21 (2016), pp. Paper No. 22, 41

work page 2016
[17]

Villani, Optimal transport, vol

C. Villani, Optimal transport, vol. 338 of Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Springer-Verlag, Berlin, 2009. Old and new. 21

work page 2009

[1] [1]

Ambrosio, N

L. Ambrosio, N. Gigli, and G. Savar ´e, Gradient flows in metric spaces and in the space of probability measures, Lectures in Mathematics ETH Z¨ urich, Birkh¨ auser Verlag, Basel, sec- ond ed., 2008

work page 2008

[2] [2]

Arjovsky, S

M. Arjovsky, S. Chintala, and L. Bottou , Wasserstein generative adversarial networks , in Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh, eds., vol. 70 of Proceedings of Machine Learning Research, PMLR, 06–11 Aug 2017, pp. 214–223

work page 2017

[3] [3]

Barbu and M

V. Barbu and M. R ¨ockner, From nonlinear Fokker-Planck equations to solutions of distri- bution dependent SDE , Ann. Probab., 48 (2020), pp. 1902–1920

work page 2020

[4] [4]

Carmona and F

R. Carmona and F. Delarue , Probabilistic theory of mean field games with applications. I , Springer, Cham, 2018

work page 2018

[5] [5]

Faster SGD training by minibatch persistency

M. Fischetti, I. Mandatelli, and D. Salvagnin , Faster SGD training by minibatch per- sistency, CoRR, abs/1806.07353 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Advances in Neural Informa- tion Processing Systems 27, 2014, pp. 2672–2680

work page 2014

[7] [7]

I. J. Goodfellow , NIPS 2016 tutorial: Generative adversarial networks , (2016). Available at https://arxiv.org/abs/1701.00160

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Huang, S.-C

Y.-J. Huang, S.-C. Lin, Y.-C. Huang, K.-H. Lyu, H.-H. Shen, and W.-Y. Lin , On characterizing optimal Wasserstein GAN solutions for non-Gaussian data , in 2023 IEEE Inter- national Symposium on Information Theory (ISIT), 2023, pp. 909–914

work page 2023

[9] [9]

Huang and Y

Y.-J. Huang and Y. Zhang , GANs as gradient flows that converge , Journal of Machine Learning Research, 24 (2023), pp. 1–40

work page 2023

[10] [10]

Karatzas and S

I. Karatzas and S. E. Shreve, Brownian motion and stochastic calculus, vol. 113 of Graduate Texts in Mathematics, Springer-Verlag, New York, second ed., 1991. 20

work page 1991

[11] [11]

Adversarial Computation of Optimal Transport Maps

J. Leygonie, J. She, A. Almahairi, S. Rajeswar, and A. C. Courville , Adversarial computation of optimal transport maps , CoRR, abs/1906.09691 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[12] [12]

L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein , Unrolled generative adversarial networks, in International Conference on Learning Representations, 2017

work page 2017

[13] [13]

Petzka, A

H. Petzka, A. Fischer, and D. Lukovnikov , On the regularization of wasserstein GANs , in International Conference on Learning Representations, 2018

work page 2018

[14] [14]

Santambrogio, Optimal transport for applied mathematicians, vol

F. Santambrogio, Optimal transport for applied mathematicians, vol. 87 of Progress in Nonlin- ear Differential Equations and their Applications, Birkh¨ auser/Springer, Cham, 2015. Calculus of variations, PDEs, and modeling

work page 2015

[15] [15]

Seguy, B

V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel , Large scale optimal transport and mapping estimation , in International Conference on Learning Representations, 2018

work page 2018

[16] [16]

Trevisan, Well-posedness of multidimensional diffusion processes with weakly differentiable coefficients, Electron

D. Trevisan, Well-posedness of multidimensional diffusion processes with weakly differentiable coefficients, Electron. J. Probab., 21 (2016), pp. Paper No. 22, 41

work page 2016

[17] [17]

Villani, Optimal transport, vol

C. Villani, Optimal transport, vol. 338 of Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Springer-Verlag, Berlin, 2009. Old and new. 21

work page 2009