pith. machine review for the scientific record. sign in

arxiv: 2603.25182 · v3 · submitted 2026-03-26 · 🧮 math.OC

Recognition: 2 theorem links

· Lean Theorem

Learning Monge maps with constrained drifting models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:43 UTC · model grok-4.3

classification 🧮 math.OC
keywords optimal transportMonge mapgradient flowconstrained optimizationneural networkslog-concave measurenatural gradient descent
0
0 comments X

The pith

A constrained gradient flow on transport maps converges to the optimal Monge map as time goes to infinity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an evolution equation for transport maps that acts as the gradient flow of a lifted user-chosen divergence while staying inside the convex set of optimal transport maps. It proves existence of long-time solutions and convergence to the true OT map when the target measure is log-concave and the divergence meets standard convexity conditions. The flow is discretized in explicit and implicit schemes, with the implicit version proven to converge, and used to train convexity-constrained neural networks. This training is equivalent to natural gradient descent on the network parameters, and experiments show it produces more accurate maps with greater stability than ordinary Euclidean gradient descent or Adam.

Core claim

We propose a gradient flow in the space of transport maps obtained by lifting a divergence functional and constraining the flow to the convex set of Monge maps. Under standard convexity assumptions on the divergence and log-concavity of the target measure, the flow exists globally in time and converges to the optimal transport map. Time discretizations of this flow are used to train convexity-constrained neural networks parameterizing the maps, and the implicit scheme is proven to converge to the OT map.

What carries the argument

The constrained gradient flow of a lifted divergence on the convex set of optimal transport maps, which enforces optimality via the convexity constraint.

If this is right

  • Long-time solutions to the flow exist and it converges to the OT map under the stated conditions.
  • The implicit time-discrete scheme converges to the OT map as the time step goes to zero and iterations increase.
  • Training with the discretizations is equivalent to natural gradient descent of the lifted divergence in the neural network parameter space.
  • The resulting maps approximate the OT map more accurately than those from Euclidean gradient descent on the same networks.
  • Training remains more stable and outperforms the Adam optimizer applied to the same convexity-constrained networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The natural-gradient equivalence could let similar drifting-model techniques scale OT map learning to higher dimensions without explicit density estimation.
  • The method might be combined with other parameterizations beyond neural networks to handle non-log-concave targets by relaxing the constraint.
  • Links to drifting generative models suggest the flow could be hybridized with diffusion-based samplers for joint OT and sampling tasks.

Load-bearing premise

The target probability measure must be log-concave and the chosen divergence must satisfy standard convexity conditions to guarantee convergence of the flow.

What would settle it

A numerical simulation of the flow for a simple known pair of source and log-concave target measures where the map does not approach the analytically known optimal transport map as time increases would disprove the convergence result.

Figures

Figures reproduced from arXiv: 2603.25182 by Fran\c{c}ois-Xavier Vialard (1) ((1) LIGM), Th\'eo Dumont (1), Th\'eo Lacombe (1).

Figure 1
Figure 1. Figure 1: Simplified view of the geometries underlying the Euclidean (eucl.gf) and natural (nat.gf) gradient structures in Θ. (a) For the standard gradient flow, the parameter space Θ is endowed with the Euclidean metric and the optimization takes place in a curved L 2 ϱ0 (R d , R d ). (b) For the L 2 ϱ0 -natural gradient flow, the optimization takes place in a flat L 2 ϱ0 (R d , R d ) and Θ is endowed with the (non… view at source ↗
Figure 2
Figure 2. Figure 2: Histograms of the MMD values that result from methods (a, b, c, d). For each one, we plot and display the values of the mean and standard deviation over the 100 seeds. For method (c), values are clipped at 0.13 for the clarity of display (the mean and standard vari￾ation are kept unchanged). (iii) Methods to be compared and their parameters. We consider the following methods to be compared. First, our two … view at source ↗
Figure 3
Figure 3. Figure 3: (left) Example of an unsatisfying learned OT map, obtained with method (d), with a MMD value of ≈ 0.07. (right) Example of a satisfying learned OT map, obtained with method (b), with a MMD value of ≈ 0.02. Remark 3.13 (From theoretical to numerical convergence guarantees). This numerical illustration aims at providing an elementary proof-of-concept to support the theoretical guarantees of Section 2: showin… view at source ↗
read the original abstract

We study the estimation of optimal transport (OT) maps between an arbitrary source probability measure and a log-concave target probability measure. Our contributions are twofold. First, we propose a new evolution equation in the set of transport maps. It can be seen as the gradient flow of a lift of some user-chosen divergence (e.g., the KL divergence, or relative entropy) to the space of transport maps, constrained to the convex set of optimal transport maps. We prove the existence of long-time solutions to this flow as well as its convergence toward the OT map as time goes to infinity, under standard convexity conditions on the divergence. Second, we study the practical implementation of this constrained gradient flow. We propose two time-discrete computational schemes-one explicit, one implicit-, and we prove the convergence of the latter to the OT map as time goes to infinity. We then parameterize the OT maps with convexity-constrained neural networks and train them with these discretizations of the constrained gradient flow. We show that this is equivalent to performing a natural gradient descent of the lift of the chosen divergence in the neural networks' parameter space, similarly to drifting generative models. Empirically, our scheme outperforms the standard Euclidean gradient descent methods used to train convexity-constrained neural networks in terms of approximation results for the OT map and convergence stability, and it still yields better results than the same approach combined with the widely used Adam optimizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a constrained gradient flow on the space of transport maps, defined as the gradient flow of a lifted user-chosen divergence (e.g., KL) restricted to the convex set of optimal transport maps. It proves existence of long-time solutions and convergence to the OT map as t→∞ under log-concavity of the target and standard convexity conditions on the divergence. Two time-discrete schemes (explicit and implicit) are proposed, with a convergence proof given for the implicit scheme; the maps are then parameterized by convexity-constrained neural networks, shown to be equivalent to natural gradient descent in parameter space, and demonstrated empirically to outperform Euclidean gradient descent and Adam on approximation quality and stability.

Significance. If the proofs hold, the work supplies a theoretically grounded continuous-time framework for OT map learning that directly incorporates the OT constraint and connects to drifting generative models. The explicit equivalence to natural gradient descent and the reported gains in convergence stability constitute concrete strengths, especially for high-dimensional settings where standard training of convexity-constrained networks is known to be fragile.

major comments (2)
  1. [discretization analysis] The convergence statement for the implicit scheme (abstract and the discretization section) is load-bearing for the practical claims; the manuscript must clarify whether the implicit update exactly preserves membership in the OT-map set or relies on an approximate projection, and quantify the resulting discretization error relative to the continuous flow.
  2. [neural parameterization] The neural-network parameterization is stated to be equivalent to natural gradient descent of the lifted divergence; the derivation of this equivalence (parameter-space section) should be expanded to show how the convexity constraint on the network is enforced at each step without violating the OT-map convexity requirement.
minor comments (2)
  1. [notation] Notation for the lifted divergence and its gradient flow should be introduced once with a clear reference to the underlying probability measures; repeated re-definition across sections reduces readability.
  2. [experiments] The empirical section would benefit from an additional baseline that uses the same convexity-constrained network but with a standard projected gradient step, to isolate the benefit of the constrained flow from the network architecture itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment of the work, and constructive suggestions. We address each major comment below and will revise the manuscript accordingly to improve clarity on the discretization and parameterization details.

read point-by-point responses
  1. Referee: The convergence statement for the implicit scheme (abstract and the discretization section) is load-bearing for the practical claims; the manuscript must clarify whether the implicit update exactly preserves membership in the OT-map set or relies on an approximate projection, and quantify the resulting discretization error relative to the continuous flow.

    Authors: We agree this clarification is needed. In the revised discretization section we will explicitly state that the implicit scheme is the proximal mapping of the lifted divergence onto the convex set of OT maps; by definition of the proximal operator on a closed convex set, the update exactly preserves membership in the OT-map set with no approximate projection. We will also add a quantitative error bound showing that the discretization error to the continuous flow is O(Δt) in the appropriate metric (consistent with standard analysis of implicit Euler schemes on convex sets), which is already implicit in our existing convergence proof but will now be stated explicitly. revision: yes

  2. Referee: The neural-network parameterization is stated to be equivalent to natural gradient descent of the lifted divergence; the derivation of this equivalence (parameter-space section) should be expanded to show how the convexity constraint on the network is enforced at each step without violating the OT-map convexity requirement.

    Authors: We will expand the parameter-space section with a detailed derivation. The convexity-constrained network (built from input-convex layers) guarantees that every forward pass produces a convex map, hence an element of the OT-map set, for any parameter vector. The equivalence to natural gradient descent follows because the Riemannian metric induced by the network parameterization automatically respects this architectural constraint; each parameter update therefore corresponds to a tangent step that remains inside the convex set without requiring an extra projection step. We will include the explicit chain-rule computation linking the parameter-space natural gradient to the constrained flow in function space. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines a constrained gradient flow directly from a user-chosen divergence lifted to the space of transport maps, with the constraint set being the convex set of OT maps (defined independently via optimal transport). It proves long-time existence and convergence to the OT map using standard convexity assumptions on the divergence and log-concavity of the target measure; these are external conditions, not derived from the flow itself. The neural-network parameterization and equivalence to natural gradient descent in parameter space is a derived computational observation, not a reduction of the central existence/convergence claim to fitted inputs or self-citations. No step reduces by construction to its own inputs, and no load-bearing uniqueness theorem or ansatz is smuggled via self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions from optimal transport and convex analysis rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Target probability measure is log-concave
    Invoked in the abstract to guarantee convergence of the constrained flow to the OT map.
  • domain assumption Divergence satisfies standard convexity conditions
    Required for the existence of long-time solutions and convergence as stated.

pith-pipeline@v0.9.0 · 5572 in / 1234 out tokens · 25401 ms · 2026-05-15T00:43:55.230881+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

  1. [1]

    Hessian Riemannian gradient flows in convex programming

    [ABB04] Felipe Alvarez, J´ erˆ ome Bolte, and Olivier Brahic. “Hessian Riemannian gradient flows in convex programming”. In:SIAM journal on control and optimization43.2 (2004), pp. 477–501 (cit. on p. 4). [ABS+21] Luigi Ambrosio, Elia Bru´ e, Daniele Semola, et al.Lectures on optimal transport. Vol

  2. [2]

    Minimizing flows for the Monge– Kantorovich problem

    Springer, 2021 (cit. on p. 12). [AGS08] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar´ e.Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008 (cit. on pp. 6–8, 11–13, 15, 31, 33). [AHT03] Sigurd Angenent, Steven Haker, and Allen Tannenbaum. “Minimizing flows for the Monge– Kantorovich proble...

  3. [3]

    Supervised training of conditional Monge maps

    Springer, 2014 (cit. on p. 8). [BKC22] Charlotte Bunne, Andreas Krause, and Marco Cuturi. “Supervised training of conditional Monge maps”. In:Advances in Neural Information Processing Systems35 (2022), pp. 6859–6872 (cit. on p. 4). [BL19] Sergey Bobkov and Michel Ledoux.One-dimensional empirical measures, order statistics, and Kantorovich transport distances. Vol

  4. [4]

    Distribution’s template esti- mate with Wasserstein metrics

    American Mathematical Society, 2019 (cit. on p. 6). learning monge maps with constrained drifting models25 [BLGL15] Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. “Distribution’s template esti- mate with Wasserstein metrics”. In:Bernoulli(2015), pp. 740–759 (cit. on p. 4). [BM22] Guillaume Bonnet and Jean-Marie Mirebeau. “Monotone discretiza...

  5. [5]

    From optimal transport to generative modeling: the VEGAN cookbook

    arXiv: 1705.07642(cit. on p. 4). [BPV25] Rapha¨ el Barboni, Gabriel Peyr´ e, and Fran¸ cois-Xavier Vialard. “Understanding the training of infinitely deep and wide resnets with conditional optimal transport”. In:Communications on Pure and Applied Mathematics78.11 (2025), pp. 2149–2205 (cit. on p. 2). [Br´ e11] Haim Br´ ezis.Functional analysis, Sobolev sp...

  6. [6]

    D´ ecomposition polaire et r´ earrangement monotone des champs de vecteurs

    Springer, 2011 (cit. on pp. 10, 31, 35). [Bre87] Yann Brenier. “D´ ecomposition polaire et r´ earrangement monotone des champs de vecteurs”. In: CR Acad. Sci. Paris S´ er. I Math.305 (1987), pp. 805–808 (cit. on pp. 2, 5). [BRX25] Qinxun Bai, Steven Rosenberg, and Wei Xu. “Generalized Tangent Kernel: A Unified Geometric Foundation for Natural Gradient and...

  7. [7]

    Learning gradients of convex func- tions with monotone gradient networks

    arXiv:2603. 01977(cit. on pp. 8, 21). [CPM23] Shreyas Chaudhari, Srinivasa Pranav, and Jos´ e MF Moura. “Learning gradients of convex func- tions with monotone gradient networks”. In:ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5 (cit. on p. 4). [CPM25] Shreyas Chaudhari, Srinivasa P...

  8. [8]

    Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizit¨ at von Markoffschen Ketten

    arXiv:2507.13191(cit. on p. 4). [Csi63] Imre Csisz´ ar. “Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizit¨ at von Markoffschen Ketten”. In:A Magyar Tudom´ anyos Akad´ emia Matematikai Kutat´ o Int´ ezet´ enek K¨ ozlem´ enyei8.1-2 (1963), pp. 85–108 (cit. on p. 14). [CSS18] Denis Chetverikov, Andres Santos, and Azee...

  9. [9]

    arXiv:2201.12324(cit. on p. 21). [CWL26] Jiarui Cao, Zixuan Wei, and Yuxin Liu.Gradient Flow Drifting: Generative Modeling via Wasser- stein Gradient Flows of KDE-Approximated Divergences

  10. [10]

    arXiv:2603.10592(cit. on p. 4). dumont, lacombe, and vialard26 [Den+26] Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He.Generative Modeling via Drifting

  11. [11]

    Generative Modeling via Drifting

    arXiv:2602.04770(cit. on pp. 4, 21). [DG93] Ennio De Giorgi. “New problems on minimizing movements”. In:Ennio de Giorgi: selected papers (1993), pp. 699–713 (cit. on pp. 4, 11). [DM23] Alex Delalande and Quentin Merigot. “Quantitative stability of optimal transport maps under variations of the target measure”. In:Duke Mathematical Journal172.17 (2023), pp...

  12. [12]

    Real Analysis and Probability

    arXiv:2504. 19779(cit. on p. 4). [Dud45] RM Dudley. “Real Analysis and Probability”. In:American history1861.1900 (1945) (cit. on p. 15). [Fan+23] Jiaojiao Fan, Shu Liu, Shaojun Ma, Hao-Min Zhou, and Yongxin Chen. “Neural Monge map estimation and its applications”. In:Transactions on Machine Learning Research(2023) (cit. on p. 4). [Fey+19] Jean Feydy, Thi...

  13. [13]

    Convergence de la r´ epartition empirique vers la r´ epartition th´ eorique

    arXiv:2310.18078(cit. on p. 15). [FM53] Robert Fortet and Edith Mourier. “Convergence de la r´ epartition empirique vers la r´ epartition th´ eorique”. In:Annales scientifiques de l’´Ecole normale sup´ erieure. Vol

  14. [14]

    Convexity in ReLU Neural Networks: beyond ICNNs?

    1953, pp. 267–285 (cit. on p. 15). [Gag+25] Anne Gagneux, Mathurin Massias, Emmanuel Soubies, and R´ emi Gribonval. “Convexity in ReLU Neural Networks: beyond ICNNs?” In:Journal of Mathematical Imaging and Vision67.4 (2025), p. 40 (cit. on pp. 4, 17). [Gho+21] Avishek Ghosh, Ashwin Pananjady, Adityanand Guntuboyina, and Kannan Ramchandran. “Max- affine re...

  15. [15]

    Generative adversarial networks

    American Mathematical Society, 2011 (cit. on p. 32). [Goo+20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial networks”. In:Communications of the ACM63.11 (2020), pp. 139–144 (cit. on p. 2). [Gre+06] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bern...

  16. [16]

    On differentiability in the Wasserstein space and well- posedness for Hamilton–Jacobi equations

    arXiv:2408.06534(cit. on p. 4). [GT19] Wilfrid Gangbo and Adrian Tudorascu. “On differentiability in the Wasserstein space and well- posedness for Hamilton–Jacobi equations”. In:Journal de Math´ ematiques Pures et Appliqu´ ees 125 (2019), pp. 119–174 (cit. on pp. 7, 35). [GTY04] Joan Glaunes, Alain Trouv´ e, and Laurent Younes. “Diffeomorphic matching of ...

  17. [17]

    Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel

    Ieee. 2004, pp. II–II (cit. on p. 8). learning monge maps with constrained drifting models27 [Hag+24] Paul Hagemann, Johannes Hertrich, Fabian Altekr¨ uger, Robert Beinert, Jannis Chemseddine, and Gabriele Steidl. “Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel”. In:ICLR. 2024 (cit. on p. 8). [Han92] Leonid G Hanin. “K...

  18. [18]

    Generative Sliced MMD Flows with Riesz Kernels

    arXiv:2603.12366(cit. on p. 21). [Her+24] Johannes Hertrich, Christian Wald, Fabian Altekr¨ uger, and Paul Hagemann. “Generative Sliced MMD Flows with Riesz Kernels”. In:ICLR. 2024 (cit. on p. 8). [HJA20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In: Advances in neural information processing systems33 (2020), p...

  19. [19]

    Minimax estimation of smooth optimal transport maps

    arXiv:2602.10726(cit. on p. 21). [HR21] Jan-Christian H¨ utter and Philippe Rigollet. “Minimax estimation of smooth optimal transport maps”. In:The Annals of Statistics49.2 (2021), pp. 1166–1194 (cit. on p. 4). [Hur23] Samuel Hurault. “Convergent plug-and-play methods for image inverse problems with explicit and nonconvex deep regularization”. PhD thesis....

  20. [20]

    Representation theorem for convex nonparametric least squares

    2023 (cit. on p. 4). [Kul59] Solomon Kullback.Information theory and statistics. Courier Corporation, 1959 (cit. on p. 14). [Kuo08] Timo Kuosmanen. “Representation theorem for convex nonparametric least squares”. In:The Econometrics Journal11.2 (2008), pp. 308–325 (cit. on p. 4). [Laf88] John D Lafferty. “The density manifold and configuration space quant...

  21. [21]

    arXiv:1906.09691(cit. on p. 4). [Liu+21] Shu Liu, Shaojun Ma, Yongxin Chen, Hongyuan Zha, and Haomin Zhou.Learning high dimen- sional wasserstein geodesics

  22. [22]

    Natural gradient via optimal transport

    arXiv:2102.02992(cit. on p. 4). [LM18] Wuchen Li and Guido Mont´ ufar. “Natural gradient via optimal transport”. In:Information Geometry1 (2018), pp. 181–214 (cit. on p. 19). [LM24] Cyril Letrouit and Quentin M´ erigot.Gluing methods for quantitative stability of optimal transport maps

  23. [23]

    The flow map of the Fokker–Planck equation does not provide optimal transport

    arXiv:2411.04908(cit. on pp. 14, 15). [LS22] Hugo Lavenant and Filippo Santambrogio. “The flow map of the Fokker–Planck equation does not provide optimal transport”. In:Applied Mathematics Letters133 (2022), p. 108225 (cit. on p. 10). [Lu+20] Guansong Lu, Zhiming Zhou, Jian Shen, Cheng Chen, Weinan Zhang, and Yong Yu.Large-scale optimal transport via adve...

  24. [24]

    Optimal transport map- ping via input convex neural networks

    arXiv:2003.06635(cit. on p. 4). [Mak+20] Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. “Optimal transport map- ping via input convex neural networks”. In:International Conference on Machine Learning. PMLR. 2020, pp. 6672–6681 (cit. on p. 4). [Mar10] James Martens. “Deep learning via Hessian-free optimization”. In:Proceedings of the 27th ...

  25. [25]

    The entropic optimal (self-)transport problem: Limit distributions for decreasing regularization with application to score function estimation

    arXiv:2412.12007(cit. on p. 21). [Mor62] Jean Jacques Moreau. “D´ ecomposition orthogonale d’un espace hilbertien selon deux cˆ ones mutuellement polaires”. In:Comptes rendus hebdomadaires des s´ eances de l’Acad´ emie des sci- ences255 (1962), pp. 238–240 (cit. on p. 31). [MS20] Matteo Muratori and Giuseppe Savar´ e. “Gradient flows and evolution variati...

  26. [26]

    Gradient flows of non convex functionals in Hilbert spaces and applications

    arXiv:2111.12187(cit. on p. 4). [RS06] Riccarda Rossi and Giuseppe Savar´ e. “Gradient flows of non convex functionals in Hilbert spaces and applications”. In:ESAIM: Control, Optimisation and Calculus of Variations12.3 (2006), pp. 564–614 (cit. on pp. 4, 6, 11, 12). [RW98] R Tyrrell Rockafellar and Roger JB Wets.Variational analysis. Springer, 1998 (cit. ...

  27. [27]

    12744(cit

    arXiv:1910 . 12744(cit. on p. 4). [Sch22] Alexander Schmeding.An introduction to infinite-dimensional differential geometry. Vol

  28. [28]

    Cambridge University Press, 2022 (cit. on p. 5). [Seg+17] Vivien Seguy, Bharath Bhushan Damodaran, R´ emi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel.Large-scale optimal transport and mapping estimation

  29. [29]

    Metriz- ing weak convergence with maximum mean discrepancies

    arXiv:1711. 02283(cit. on p. 4). [SG+23] Carl-Johann Simon-Gabriel, Alessandro Barp, Bernhard Sch¨ olkopf, and Lester Mackey. “Metriz- ing weak convergence with maximum mean discrepancies”. In:Journal of Machine Learning Research24.184 (2023), pp. 1–20 (cit. on p. 15). [Son+21] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano E...

  30. [30]

    On integral probability metrics, \phi-divergences and binary classification

    arXiv: 0901.2698(cit. on p. 15). [Sta59] Aart J Stam. “Some inequalities satisfied by the quantities of information of Fisher and Shan- non”. In:Information and Control2.2 (1959), pp. 101–112 (cit. on p. 8). [Sut+13] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. “On the importance of ini- tialization and momentum in deep learning”. In:I...

  31. [31]

    Parameter tuning and model selection in optimal transport with semi-dual Brenier formulation

    Springer, 2009 (cit. on pp. 5, 9, 14, 15). [VV22] Adrien Vacher and Fran¸ cois-Xavier Vialard. “Parameter tuning and model selection in optimal transport with semi-dual Brenier formulation”. In:Advances in Neural Information Processing Systems35 (2022), pp. 23098–23108 (cit. on p. 4). [WB19] Jonathan Weed and Francis Bach. “Sharp asymptotic and finite-sam...