pith. sign in

arxiv: 1906.08227 · v1 · pith:CNVLUA63new · submitted 2019-06-19 · 📊 stat.ML · cs.LG· stat.AP

Local Bures-Wasserstein Transport: A Practical and Fast Mapping Approximation

Pith reviewed 2026-05-25 20:08 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.AP
keywords optimal transporttransport mapWasserstein barycenterGaussian mixture modelBures-Wassersteinapproximationmachine learning
0
0 comments X

The pith

Matching Gaussian components and applying closed-form Bures-Wasserstein maps between pairs produces a fast approximate optimal transport map that generalizes to new points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes to approximate the optimal transport map by fitting Gaussian mixture models to source and target distributions, matching their components, and transporting each pair with the analytic Bures-Wasserstein formula. This construction sidesteps the scaling problems of kernel-based map learners when sample counts grow large. The resulting map evaluates on unseen points and also supplies a parametric form for the Wasserstein barycenter. Experiments on simulated and real data indicate an 80-fold reduction in running time together with recovery of the barycenter support using fewer components than prior methods.

Core claim

By decomposing each density into a Gaussian mixture, matching components across the pair of mixtures, and applying the closed-form Bures-Wasserstein transport to every matched pair, one obtains an approximate global transport map that generalizes out of sample and serves as a practical approximation to the Wasserstein barycenter.

What carries the argument

Local Bures-Wasserstein transport between matched pairs of Gaussian components drawn from the mixture models of the two densities.

If this is right

  • Overall running time drops by a factor of roughly 80 compared with kernel-based map estimation methods.
  • Fewer mixture components suffice to recover the support of the Wasserstein barycenter.
  • The learned map applies immediately to out-of-sample points without retraining.
  • The procedure integrates directly into standard machine-learning pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-matching idea could be tried with other families that admit closed-form transport maps.
  • The speedup may open optimal-transport computations to online or streaming settings where kernel methods become prohibitive.
  • Because the map is parametric, it could be combined with density-estimation routines that already use mixture models.

Load-bearing premise

That matching Gaussian components across the two densities and transporting each matched pair independently with the Bures-Wasserstein map produces a sufficiently accurate global transport.

What would settle it

A direct comparison on two non-Gaussian mixture distributions where the component-matched map produces large discrepancies in transported mass relative to a ground-truth optimal map computed at high cost.

Figures

Figures reproduced from arXiv: 1906.08227 by Andr\'es Hoyos-Idrobo.

Figure 1
Figure 1. Figure 1: Illustration of the working principle of the approximation of the barycen￾ter: For each group g (e.g., heart and clover), the algorithm receives a data matrix Xg . It computes a GMM for each group. It uses the means of these GMMs as an approximation of the geometry. Next, it matches these means.Finally, it computes the Bures-Wasserstein barycenter for each matched pair. Practical computation of the barycen… view at source ↗
Figure 2
Figure 2. Figure 2: Performance evaluation. The vertical dashed line denotes k = 20. L-BW requires fewer atoms to recover the support, and it is the fastest method. 5.1 Shape interpolation of clouds of points A straightforward application of Wasserstein barycenter is the interpolation of shapes, as it relies on performing optimal transportation over geometric domains. We vary the number of atoms k ∈ {2, . . . , 100} and evalu… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of barycenters: The coordinates represent the Wasserstein simplex, which consists in all of their Wasserstein barycenters under varying weights λ ∈ ∆3 . Fig.3a displays the barycenters obtained by [27] after thresholding using Otsu’s method. Computation time. We measured the running time of various approaches, including both the training time and the transport time. Fig.2b gives computation t… view at source ↗
Figure 4
Figure 4. Figure 4: Computation time taken to reach a solution: Quality of the fit of a `2 penalized logistic regression as a func￾tion of the computation time for a fixed number of atoms. (top) Predictive accu￾racy and (bottom) fairness score on held￾out-data. In both datasets, L-BW obtains the best fairness score, and the same pre￾diction score than ME-Gaussian with less computation time to reach a stable so￾lution. The tim… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of mixing distributions with L-BW on performance of various classifiers: The performance of classifiers af￾ter mixing sensitive groups relative to mean scores of the raw classifier (per dataset). The fairness score, the lower, the better (< 0). The AUC, the higher, the better (> 0). L-BW im￾proves the fairness of logistic regression while maintaining prediction accuracy. For Gradient Bosting, L-BW r… view at source ↗
Figure 7
Figure 7. Figure 7: Setting the number of com￾ponents: The Akaike information criterion (AIC) as a function of the number of atoms for each silhouette. k = 20 achieves the minimum AIC for all shapes. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of barycenters for different methods: The coordinates repre￾sent the Wasserstein simplex, which consists in all of their Wasserstein barycenters under varying weights λ ∈ ∆3 . (top) row corresponds to k = 10; (bottom) row corresponds to k = 100. 7.2 Statistical-parity fairness experiment 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate DP￾ [PITH_FULL_IMA… view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Support recovery on simulated data: Evaluation of the performance varying the number k of atoms, for barycenters obtained with different weights λ ∈ ∆3 . Coordinates in the Wasserstein simplex represent shapes: cat (λ = [1, 0, 0]), rabbit (λ = [0, 1, 0]), and tooth (λ = [0, 0, 1]). The horizontal dashed line denotes 85% of the agreement. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of the first two principal components for different datasets: We can see that for most of the datasets the groups are well separated. Training a classifier on top of this representation will lead to leak of information about the protected groups. Analysis (LDA), Gradient Boosting Trees (100 trees), Random Forest (100 trees), and Multi Layer Perceptron (MLP, 100 hidden units and ReLu as activ… view at source ↗
Figure 12
Figure 12. Figure 12: Setting the number of components: The AIC as a function the number of Gaussian components for all datasets and sensitive conditions. For most of the datasets, the optimal number of components is different for each sensitive group. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Impact of mixing distributions with L-BW on performance of various conventional classifiers: The performance of classifiers after mixing groups relative to mean scores of the raw classifier. The fairness score, the lower, the better (< 0). The AUC, the higher, the better (> 0). Mixing groups improves the fairness for most methods while preserving the predictive accuracy. 18 [PITH_FULL_IMAGE:figures/full_… view at source ↗
read the original abstract

Optimal transport (OT)-based methods have a wide range of applications and have attracted a tremendous amount of attention in recent years. However, most of the computational approaches of OT do not learn the underlying transport map. Although some algorithms have been proposed to learn this map, they rely on kernel-based methods, which makes them prohibitively slow when the number of samples increases. Here, we propose a way to learn an approximate transport map and a parametric approximation of the Wasserstein barycenter. We build an approximated transport mapping by leveraging the closed-form of Gaussian (Bures-Wasserstein) transport; we compute local transport plans between matched pairs of the Gaussian components of each density. The learned map generalizes to out-of-sample examples. We provide experimental results on simulated and real data, comparing our proposed method with other mapping estimation algorithms. Preliminary experiments suggest that our proposed method is not only faster, with a factor 80 overall running time, but it also requires fewer components than state-of-the-art methods to recover the support of the barycenter. From a practical standpoint, it is straightforward to implement and can be used with a conventional machine learning pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Local Bures-Wasserstein Transport, an approximation to the optimal transport map obtained by fitting Gaussian mixture models to source and target measures, matching their components, and composing the closed-form Bures-Wasserstein maps between each matched pair. The resulting parametric map is claimed to generalize to out-of-sample points and to yield a practical approximation to the Wasserstein barycenter that requires fewer components than competing methods. Preliminary experiments on simulated and real data are said to demonstrate an 80-fold reduction in overall running time together with improved component efficiency.

Significance. If the reported speed and component-count advantages are confirmed by properly documented experiments, the method supplies a lightweight, pipeline-friendly alternative to kernel-based transport-map learners that scales to larger sample regimes while remaining straightforward to implement.

major comments (3)
  1. [Experiments] Experiments section: the abstract asserts an overall 80× running-time reduction and fewer components for barycenter support recovery, yet no tables, timing breakdowns, component counts, error bars, or dataset specifications appear; without these quantitative results the central empirical claim cannot be evaluated.
  2. [Method] Method (component-matching paragraph): the procedure used to pair Gaussian components across the two fitted mixtures is stated only at a high level; the cost function, assignment algorithm, and handling of unequal numbers of components are not specified, rendering the local-transport construction non-reproducible and its accuracy unassessable.
  3. [Experiments] Out-of-sample generalization claim: the paper states that the learned map generalizes, but no quantitative hold-out error, comparison against in-sample performance, or baseline map learners on unseen points is reported, which is load-bearing for the practical utility asserted in the abstract.
minor comments (2)
  1. [Method] Notation for the component-wise transport map is introduced without an explicit equation number; adding a displayed equation would improve clarity.
  2. [Introduction] The abstract mentions “state-of-the-art methods” for comparison but does not name them; a brief list in the introduction would help readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and commit to a revised manuscript that incorporates the requested clarifications and additional results.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the abstract asserts an overall 80× running-time reduction and fewer components for barycenter support recovery, yet no tables, timing breakdowns, component counts, error bars, or dataset specifications appear; without these quantitative results the central empirical claim cannot be evaluated.

    Authors: We agree that the current manuscript presents only preliminary experimental results without the detailed quantitative documentation needed to substantiate the claims. In the revision we will add tables reporting timing breakdowns, component counts, error bars, and full dataset specifications to allow proper evaluation of the reported speed and efficiency advantages. revision: yes

  2. Referee: [Method] Method (component-matching paragraph): the procedure used to pair Gaussian components across the two fitted mixtures is stated only at a high level; the cost function, assignment algorithm, and handling of unequal numbers of components are not specified, rendering the local-transport construction non-reproducible and its accuracy unassessable.

    Authors: The referee correctly identifies that the component-matching step is described at an insufficient level of detail. We will expand the relevant paragraph to specify the cost function (Bures-Wasserstein distance between fitted Gaussians), the assignment procedure (Hungarian algorithm), and the rule for unequal component counts (e.g., a greedy matching with a distance threshold for unmatched components). revision: yes

  3. Referee: [Experiments] Out-of-sample generalization claim: the paper states that the learned map generalizes, but no quantitative hold-out error, comparison against in-sample performance, or baseline map learners on unseen points is reported, which is load-bearing for the practical utility asserted in the abstract.

    Authors: We acknowledge that the manuscript currently asserts out-of-sample generalization without supporting quantitative evidence. The revised version will include dedicated hold-out experiments that report generalization error, in-sample versus out-of-sample comparisons, and direct comparisons against baseline map learners on unseen points. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs an approximate transport map by matching Gaussian mixture components and applying the known closed-form Bures-Wasserstein map between Gaussians. This closed-form is external mathematical knowledge, not derived or fitted within the paper. The central claims are empirical (speed and component count on tested data) rather than a parameter-free derivation that reduces to its own inputs. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the provided abstract or description. The method is internally consistent for its stated goal of a practical approximation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method relies on the existence of a closed-form Bures-Wasserstein map between Gaussians (standard result) and on the ability to match mixture components across densities (domain modeling choice). No new entities are postulated. Number of mixture components is a free modeling choice whose selection affects the reported speed and accuracy.

free parameters (1)
  • number of Gaussian components
    Chosen to recover barycenter support; the abstract states the method requires fewer components than baselines, implying this hyper-parameter is tuned or selected per experiment.
axioms (2)
  • standard math Bures-Wasserstein distance admits a closed-form optimal transport map between two Gaussians
    Invoked to justify the local transport plans; this is a known result from prior literature on Gaussian OT.
  • domain assumption Gaussian mixture models can be matched component-wise to approximate the global transport
    Core modeling step of the proposed local map; not derived in the abstract.

pith-pipeline@v0.9.0 · 5734 in / 1505 out tokens · 22410 ms · 2026-05-25T20:08:47.452363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    Wasserstein GAN

    M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan.arXiv preprint arXiv:1701.07875, 2017

  2. [2]

    F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. InICML. ACM, 2005

  3. [3]

    Bhatia, T

    R. Bhatia, T. Jain, and Y. Lim. On the bures–wasserstein distance between positive definite matrices. Expositiones Mathematicae, 2018

  4. [4]

    C. Blake. Uci repository of machine learning databases. http://www. ics. uci. edu/˜ mlearn/MLRepository. html, 1998

  5. [5]

    Bonneel, M

    N. Bonneel, M. Van De Panne, S. Paris, and W. Heidrich. Displacement interpolation using lagrangian mass transport. InACM Transactions on Graphics (TOG), volume 30. ACM, 2011

  6. [6]

    R. E. Burkard, M. Dell’Amico, and S. Martello.Assignment problems. Springer, 2009

  7. [7]

    Courty, R

    N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE TPAMI, 39, 2017

  8. [8]

    M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InNeurIPS, 2013

  9. [9]

    Cuturi and A

    M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. InICML, 2014

  10. [10]

    Obtaining fairness using optimal transport theory

    E. del Barrio, F. Gamboa, P. Gordaliza, and J.-M. Loubes. Obtaining fairness using optimal transport theory.arXiv preprint arXiv:1806.03195, 2018

  11. [11]

    Flamary and N

    R. Flamary and N. Courty. Pot python optimal transport library, 2017. URLhttps://github. com/rflamary/POT

  12. [12]

    Flamary, M

    R. Flamary, M. Cuturi, N. Courty, and A. Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107, 2018

  13. [13]

    Friedman, T

    J. Friedman, T. Hastie, and R. Tibshirani.The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. 11

  14. [14]

    Genevay, G

    A. Genevay, G. Peyre, and M. Cuturi. Learning generative models with sinkhorn divergences. In AISTATS, 2018

  15. [15]

    C. R. Givens, R. M. Shortt, et al. A class of wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31, 1984

  16. [16]

    Jones, T

    E. Jones, T. Oliphant, and P. Peterson.{SciPy}: Open source scientific tools for{Python}. 2014

  17. [17]

    Larson, S

    J. Larson, S. Mattu, L. Kirchner, and J. Angwin. How we analyzed the compas recidivism algorithm. ProPublica (5 2016), 9, 2016

  18. [18]

    Masarotto, V

    V. Masarotto, V. M. Panaretos, and Y. Zemel. Procrustes metrics on covariance operators and optimal transportation of gaussian processes.Sankhya A, 2018

  19. [19]

    J. Munkres. Algorithms for the assignment and transportation problems.Journal of the society for industrial and applied mathematics, 5, 1957

  20. [20]

    Olfat and A

    M. Olfat and A. Aswani. Spectral algorithms for computing fair support vector machines. In AISTATS, 2018

  21. [21]

    N. Otsu. A threshold selection method from gray-level histograms.IEEE transactions on systems, man, and cybernetics, 9, 1979

  22. [22]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.JMLR, 12, 2011

  23. [23]

    Perrot, N

    M. Perrot, N. Courty, R. Flamary, and A. Habrard. Mapping estimation for discrete optimal transport. In NeurIPS, 2016

  24. [24]

    Peyré, M

    G. Peyré, M. Cuturi, et al. Computational optimal transport.Foundations and TrendsR⃝ in Machine Learning, 11, 2019

  25. [25]

    Rabin, G

    J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. InInternational Conference on Scale Space and Variational Methods in Computer Vision. Springer, 2011

  26. [26]

    M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngole, D. Coeurjolly, M. Cuturi, G. Peyré, and J.-L. Starck. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11, 2018

  27. [27]

    Solomon, F

    J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34, 2015

  28. [28]

    Takatsu et al

    A. Takatsu et al. Wasserstein geometry of gaussian measures.Osaka Journal of Mathematics, 48, 2011

  29. [29]

    Tolstikhin, O

    I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders.arXiv preprint arXiv:1711.01558, 2017

  30. [30]

    C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008

  31. [31]

    J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast discrete distribution clustering using wasserstein barycenter with sparse support.IEEE Transactions on Signal Processing, 65

  32. [32]

    M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for fair classification. InAISTATS, 2017. 12

  33. [33]

    Zaharia, R

    M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al. Apache spark: a unified engine for big data processing. Communications of the ACM, 59, 2016. 7 Applications: Additional analysis 7.1 Shape interpolation of a cloud of points experiment Figure 6:Input shapes: Cat, rabbit, and tooth. Ea...