pith. machine review for the scientific record. sign in

arxiv: 2605.12161 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CY· math.MG

Recognition: no theorem link

Fused Gromov-Wasserstein Distance with Feature Selection

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:30 UTC · model grok-4.3

classification 💻 cs.LG cs.CYmath.MG
keywords fused gromov-wassersteinfeature selectionoptimal transportgraph alignmentregularizationalternating minimizationinterpretability
0
0 comments X

The pith

Fused Gromov-Wasserstein distances with feature selection incorporate adaptive suppression weights to downweight irrelevant features during alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing fused Gromov-Wasserstein distances align objects by their structure and features but treat all features the same way. This paper modifies the objective to include adaptive weights that suppress the contribution of noisy or irrelevant features while solving for the alignment. It offers two concrete ways to do this: one adds Lasso or Ridge penalties on the weights, the other constrains the weights to a simplex and allows groupwise selection. The authors prove that these new distances remain bounded by the classical versions and continue to behave like metrics. They also give an alternating minimization procedure to compute them and demonstrate the approach on a redistricting task where it highlights meaningful features.

Core claim

The paper establishes that adaptive feature suppression weights can be integrated into the FGW objective, either via regularization penalties or simplex constraints, to enable selective feature downweighting during alignment. This yields distances that satisfy theoretical bounds with respect to standard FGW and GW distances and maintain metric properties. Computation proceeds via an efficient alternating minimization scheme that alternates between the coupling and the weights.

What carries the argument

Adaptive feature suppression weights optimized jointly with the transport coupling in the FGW objective.

If this is right

  • The modified distances satisfy explicit bounds relative to classical FGW and Gromov-Wasserstein distances.
  • Metric properties are preserved under the regularized and constrained formulations.
  • Feature suppression reveals task-relevant structure in applications such as computational redistricting.
  • An alternating minimization algorithm computes the distances efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The groupwise extensions suggest the method can handle predefined feature groups without separate preprocessing.
  • Interpretability improvements may apply to other structured data comparison tasks that rely on optimal transport.
  • Joint optimization of weights and alignment could reduce sensitivity to feature scaling choices in practice.

Load-bearing premise

That jointly optimizing the alignment plan and the feature suppression weights will not introduce instability or break the metric properties in high-dimensional or correlated feature settings.

What would settle it

A numerical example with high-dimensional correlated features where the computed distance violates the triangle inequality or where the optimization yields degenerate zero weights for all features.

Figures

Figures reproduced from arXiv: 2605.12161 by Harlin Lee, Mingxin Li, Ranthony Clark, Ying Yu.

Figure 1
Figure 1. Figure 1: (a) Graphs matched by fsFGW - Ridge. (b) Feature scores [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: fsFGW performance on toy graphs (d = 100, k = 10) are robust to δ and f choice. confirming that the signal is visible to the transport plan before any suppression is applied. Figure 1c and d shows the recovered suppression weights for Ridge (f = 0.2) and Lasso (f = 0.3). Both modes correctly assign high suppression to differentiating features and low suppression to shared ones. Appendix C shows graph match… view at source ↗
Figure 3
Figure 3. Figure 3: Mean suppression weights w¯r per feature on OGBG-MOLBACE, averaged over all graph pairs. {hybridization, is_aromatic, is_in_ring} are most strongly suppressed across both Lasso and Ridge. Next are {atomic_num, degree, num_Hs}, then {chirality, formal_charge}. num_radical_e is consistently retained. 7 Application to Computational Redistricting In the United States, a redistricting plan partitions a state in… view at source ↗
Figure 4
Figure 4. Figure 4: Hierarchical clustering of plans with GW, FGW, and fsFGW (Lasso). [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mean suppression weights for fsFGW (Simplex) when comparing Plan 22 and Plan 22ct. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Suppression weights for fsFGW (Lasso) when comparing Plan 23 and Plan 25. Each row is [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Convergence of fsFGW with f = 0.3 on two graphs (Graphs 20 and 26) from OGBG￾MOLBACE. Top: FGW objective per iteration. Bottom: weight change ∥wk+1 − wk∥. All modes converged in 2-8 iterations in our observations. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (a) The two independent random geometric graphs [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Suppression weights wr recovered by each mode (n = 40, d = 10, k = 3, δ = 2.0). Differentiating features (red) should have wr = 1; shared (blue) should have wr = 0. Ridge (f = 0.2) gives continuous weights; Lasso (f = 0.3) gives binary weights recovering exactly the k = 3 differentiating features. Simplex concentrates all suppression on the single highest-scoring feature. Groupwise simplex suppresses the e… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of matchings according to different transport plans [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Agreement between different Ts, calculated as proportion of node mappings that agree between two transport plans. normalized globally to [0, 1] per dimension across the subsample. Per-feature cost matrices Mr are further normalized per pair to [0, 1]. Classification uses an SVM with RBF kernel Kij = exp(−Dij/σ) where σ is the median of nonzero distances, evaluated via 5-fold stratified CV. Clustering uses… view at source ↗
Figure 12
Figure 12. Figure 12: Classification results in Table 1 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Clustering results in Table 1 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Mean suppression weights per feature on PROTEINS-FULL (d = 32) for Lasso (left) and Ridge (right) across suppression fractions f ∈ {0.3, 0.5, 0.7}. Errorbars are omitted for readability. subgraph of the precinct adjacency graph induced by its constituent precincts. 29 node features are the precinct-level sociodemographic and environmental indicators described in the next subsection. The structure matrix C… view at source ↗
Figure 15
Figure 15. Figure 15: North Carolina Congressional Redistricting Plans (2020-2025) [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Hierarchical Clustering of NC Congressional Maps. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Top 10 mean suppression weights (± 1 standard deviation) for Plan 22 vs. Plan22ct. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Weights for each district pair compared in Plan 23 vs. Plan25. [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: fsFGW (Lasso) suppression weight heatmap for Plan 22 vs. Plan 22ct, with features [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
read the original abstract

Fused Gromov-Wasserstein (FGW) distances provide a principled framework for comparing objects by jointly aligning structure and node features. However, existing FGW formulations treat all features uniformly, which limits interpretability and robustness in high-dimensional settings where many features may be irrelevant or noisy. We introduce FGW distances with feature selection, which incorporate adaptive feature suppression weights into the FGW objective to selectively downweight or suppress differentiating features during alignment. We propose two approaches: (1) regularized FGW with Lasso and Ridge penalties, and (2) FGW with simplex-constrained weights, including groupwise extensions. We analyze the resulting models and establish their key theoretical properties, including bounds relative to classical FGW and Gromov-Wasserstein distances, and metric behavior. An efficient alternating minimization algorithm is developed. Experiments illustrate how feature suppression enhances interpretability and reveals task-relevant structure, with a special application to computational redistricting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Fused Gromov-Wasserstein distances with feature selection (FGW-FS) by incorporating adaptive feature suppression weights into the FGW objective to downweight or suppress differentiating features during alignment. Two approaches are proposed: regularized FGW using Lasso and Ridge penalties, and FGW with simplex-constrained weights (including groupwise extensions). The authors analyze the models to establish theoretical properties including bounds to classical FGW and Gromov-Wasserstein distances plus metric behavior, develop an efficient alternating minimization algorithm, and present experiments showing enhanced interpretability with an application to computational redistricting.

Significance. If the central claims on preserved metric properties and bounds hold under joint optimization of couplings and data-dependent feature weights, the work would meaningfully extend optimal transport tools for high-dimensional structured data by improving robustness to noisy or irrelevant features while adding interpretability, with potential impact on graph matching and alignment tasks.

major comments (2)
  1. [theoretical analysis section] The section establishing key theoretical properties asserts that the FGW-FS distance satisfies metric axioms and provides bounds to classical FGW/GW, but the joint alternating minimization of the coupling and the feature suppression weights (Lasso/Ridge or simplex) means the effective weights can differ across pairs in a triple. This risks violating triangle inequality, and the manuscript does not provide an explicit proof or counterexample analysis showing when the property is inherited from fixed-weight FGW.
  2. [model formulation and analysis] In the formulation of the regularized and simplex-constrained objectives, the claim that the optimized distance remains a metric with the stated bounds assumes the suppression vector is stable, yet no analysis addresses degeneracy when features are linearly dependent or high-dimensional, where the simplex or Lasso solutions may collapse to trivial suppressions (e.g., all weight on one feature).
minor comments (2)
  1. [experiments] The experimental section would benefit from explicit baseline comparisons (standard FGW without feature selection) and details on hyperparameter selection for lambda and the regularization terms to allow reproduction of the interpretability gains.
  2. [notation and preliminaries] Notation for the feature weights could be unified or clearly distinguished between the Lasso/Ridge variant and the simplex variant to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have identified important points for strengthening the theoretical analysis. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [theoretical analysis section] The section establishing key theoretical properties asserts that the FGW-FS distance satisfies metric axioms and provides bounds to classical FGW/GW, but the joint alternating minimization of the coupling and the feature suppression weights (Lasso/Ridge or simplex) means the effective weights can differ across pairs in a triple. This risks violating triangle inequality, and the manuscript does not provide an explicit proof or counterexample analysis showing when the property is inherited from fixed-weight FGW.

    Authors: We appreciate the referee highlighting this subtlety regarding the triangle inequality. The current analysis provides bounds relative to classical FGW and GW distances and states metric properties, but the joint optimization indeed makes the effective feature weights pair-dependent, and an explicit proof accounting for this was not included. We acknowledge that this requires clarification, as the inheritance from fixed-weight FGW is not automatic. In the revised manuscript, we will expand the theoretical analysis section with either a rigorous proof demonstrating that the metric axioms (including triangle inequality) continue to hold under the joint optimization, or a counterexample showing when the property fails, together with sufficient conditions for the bounds to be preserved. revision: yes

  2. Referee: [model formulation and analysis] In the formulation of the regularized and simplex-constrained objectives, the claim that the optimized distance remains a metric with the stated bounds assumes the suppression vector is stable, yet no analysis addresses degeneracy when features are linearly dependent or high-dimensional, where the simplex or Lasso solutions may collapse to trivial suppressions (e.g., all weight on one feature).

    Authors: The referee correctly notes that the manuscript does not analyze potential degeneracy of the suppression weights. In high-dimensional or linearly dependent feature regimes, the Lasso, Ridge, or simplex solutions can indeed collapse to trivial or unstable suppressions, which could affect the claimed metric properties and bounds. We will revise the model formulation and analysis section to include a dedicated discussion of these degeneracy cases. This will cover conditions under which collapse occurs, strategies for choosing regularization parameters to promote stability, and any adjustments to the theoretical bounds under mild feature independence assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces FGW with feature selection by adding adaptive suppression weights to the existing FGW objective via two concrete formulations (Lasso/Ridge regularized and simplex-constrained, plus groupwise extensions). It then claims to derive bounds to classical FGW/GW and metric behavior through analysis of the new objectives. No step reduces a claimed prediction, bound, or property to a fitted input or self-definition by construction; the theoretical claims are presented as consequences of the modified objective rather than tautological renamings or self-citations. External references to prior FGW work are standard background and not load-bearing for the new results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard properties of FGW and optimal transport distances plus the assumption that feature weights can be optimized without breaking those properties.

free parameters (1)
  • regularization strength lambda
    Controls the Lasso or Ridge penalty on feature weights; must be chosen by the user or cross-validation.
axioms (1)
  • standard math Fused Gromov-Wasserstein distance is a valid metric on the space of attributed graphs
    Invoked when claiming the new weighted version remains a metric.

pith-pipeline@v0.9.0 · 5465 in / 1176 out tokens · 68895 ms · 2026-05-13T05:30:28.777502+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Geometry of graph partitions via optimal transport.SIAM Journal on Scientific Computing, 42(5):A3340–A3366, 2020

    Tara Abrishami, Nestor Guillen, Parker Rule, Zachary Schutzman, Justin Solomon, Thomas Weighill, and Si Wu. Geometry of graph partitions via optimal transport.SIAM Journal on Scientific Computing, 42(5):A3340–A3366, 2020

  2. [2]

    Wasserstein generative adversarial networks

    Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Proceedings of Machine Learning Research, 2017

  3. [3]

    Efficient trajectory inference in wasserstein space using consecutive averaging

    Amartya Banerjee, Harlin Lee, Nir Sharon, and Caroline Moosmüller. Efficient trajectory inference in wasserstein space using consecutive averaging. InInternational Conference on Artificial Intelligence and Statistics, pages 2260–2268. Proceedings of Machine Learning Research, 2025

  4. [4]

    Borgwardt, Cheng Soon Ong, Stefan Schönauer, S

    Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schönauer, S. V . N. Vishwanathan, Alex J. Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels.Bioinformatics, 21(1):47–56, January 2005

  5. [5]

    Learning to predict graphs with fused gromov-wasserstein barycenters

    Luc Brogat-Motte, Rémi Flamary, Céline Brouard, Juho Rousu, and Florence D’Alché-Buc. Learning to predict graphs with fused gromov-wasserstein barycenters. InInternational Con- ference on Machine Learning, pages 2321–2335. Proceedings of Machine Learning Research, 2022

  6. [6]

    Semidefinite relaxations of the gromov-wasserstein distance.Advances in Neural Information Processing Systems, 37:69814– 69839, 2024

    Junyu Chen, Binh T Nguyen, Shang H Koh, and Yong S Soh. Semidefinite relaxations of the gromov-wasserstein distance.Advances in Neural Information Processing Systems, 37:69814– 69839, 2024

  7. [7]

    Clark, Tom Needham, and Thomas Weighill

    Ranthony A. Clark, Tom Needham, and Thomas Weighill. Generalized dimension reduction using semi-relaxed gromov-wasserstein distance.Proceedings of the AAAI Conference on Artificial Intelligence, 39(15):16082–16090, Apr. 2025

  8. [8]

    Joint distribution optimal transportation for domain adaptation.Advances in neural information processing systems, 30, 2017

    Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation.Advances in neural information processing systems, 30, 2017

  9. [9]

    K. Didan. MOD13Q1 MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V006, 2015

  10. [10]

    Dobson and Andrew J

    Paul D. Dobson and Andrew J. Doig. Distinguishing enzyme structures from non-enzymes without alignments.Journal of Molecular Biology, 330(4):771–783, 2003

  11. [11]

    Compressed sensing.IEEE Transactions on information theory, 52(4):1289– 1306, 2006

    David L Donoho. Compressed sensing.IEEE Transactions on information theory, 52(4):1289– 1306, 2006

  12. [12]

    Springer, 2022

    Moon Duchin, Olivia Walch, et al.Political geometry: rethinking redistricting in the US with math, law, and everything in between, volume 3. Springer, 2022

  13. [13]

    C. D. Elvidge, M. Zhizhin, T. Ghosh, and F.-C. Hsu. Annual time series of global VIIRS nighttime lights derived from monthly averages: 2012 to 2019.Remote Sensing, 13(5):922, 2021

  14. [14]

    Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

  15. [15]

    Lenssen, and Jure Leskovec

    Matthias Fey, Jinu Sunil, Akihiro Nitta, Rishi Puri, Manan Shah, Blaz Stojanovic, Ramona Bendias, Barghi Alexandria, Vid Kocijan, Zecheng Zhang, Xinwei He, Jan E. Lenssen, and Jure Leskovec. PyG 2.0: Scalable learning on real world graphs. InTemporal Graph Learning Workshop @ KDD, 2025

  16. [16]

    Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H

    Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...

  17. [17]

    POT python optimal transport (version 0.9.5), 2024

    Rémi Flamary, Cédric Vincent-Cuaz, Nicolas Courty, Alexandre Gramfort, Oleksii Kachaiev, Huy Quang Tran, Laurène David, Clément Bonet, Nathan Cassereau, Théo Gnassounou, Eloi Tanguy, Julie Delon, Antoine Collas, Sonia Mazelet, Laetitia Chapel, Tanguy Kerdoncuff, Xizheng Yu, Matthew Feickert, Paul Krzakala, Tianlin Liu, and Eduardo Fernandes Montesuma. POT...

  18. [18]

    Computational optimal transport with applications to data sciences.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019

    Peyré Gabriel and Cuturi Marco. Computational optimal transport with applications to data sciences.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019

  19. [19]

    Gorelick, M

    N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. Google Earth Engine: Planetary-scale geospatial analysis for everyone.Remote Sensing of Environment, 202:18–27, 2017

  20. [20]

    Ridge regression: applications to nonorthogonal problems.Technometrics, 12(1):69–82, 1970

    Arthur E Hoerl and Robert W Kennard. Ridge regression: applications to nonorthogonal problems.Technometrics, 12(1):69–82, 1970

  21. [21]

    arXiv preprint arXiv:2103.09430 (2021)

    Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb- lsc: A large-scale challenge for machine learning on graphs.arXiv preprint arXiv:2103.09430, 2021

  22. [22]

    Open graph benchmark: Datasets for machine learning on graphs

    Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020

  23. [23]

    Kummu, M

    M. Kummu, M. Taka, and J. H. A. Guillaume. Gridded global datasets for Gross Domestic Product and Human Development Index over 1990–2015.Scientific Data, 5:180004, 2018

  24. [24]

    Convergence rate of frank-wolfe for non-convex objectives.arXiv preprint arXiv:1607.00345, 2016

    Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives.arXiv preprint arXiv:1607.00345, 2016

  25. [25]

    Blanchet

    Jiajin Li, Jianheng Tang, Lemin Kong, Huikang Liu, Jia Li, Anthony Man-Cho So, and Jose H. Blanchet. A convergent single-loop algorithm for relaxation of gromov-wasserstein in graph data. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023

  26. [26]

    Fused gromov-wasserstein graph mixup for graph-level classifications.Advances in Neural Information Processing Systems, 36:15252–15276, 2023

    Xinyu Ma, Xu Chu, Yasha Wang, Yang Lin, Junfeng Zhao, Liantao Ma, and Wenwu Zhu. Fused gromov-wasserstein graph mixup for graph-level classifications.Advances in Neural Information Processing Systems, 36:15252–15276, 2023

  27. [27]

    Gromov–wasserstein distances and the metric approach to object matching

    Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417–487, 2011

  28. [28]

    Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann

    Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. InICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020

  29. [29]

    AIDS Antiviral Screen Data

    National Cancer Institute Developmental Therapeutics Program. AIDS Antiviral Screen Data. http://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+ Data, 2017. Accessed: 2017-09-27

  30. [30]

    Graph invariant kernels

    Francesco Orsini, Paolo Frasconi, and Luc De Raedt. Graph invariant kernels. InProceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, page 3756–3762. AAAI Press, 2015

  31. [31]

    Gromov-wasserstein averaging of kernel and distance matrices

    Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. InInternational conference on machine learning, pages 2664–2672. Proceedings of Machine Learning Research, 2016

  32. [32]

    Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

    Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996. 12

  33. [33]

    Optimal transport for structured data with application on graphs

    Vayer Titouan, Nicolas Courty, Romain Tavenard, Chapel Laetitia, and Rémi Flamary. Optimal transport for structured data with application on graphs. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6275–6284. Proceedings...

  34. [34]

    Census Bureau

    U.S. Census Bureau. American community survey 5-year estimates, 2016–2020. https: //www.census.gov/programs-surveys/acs, 2021

  35. [35]

    Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020

    Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020

  36. [36]

    Springer, 2009

    Cédric Villani et al.Optimal transport: old and new, volume 338. Springer, 2009

  37. [37]

    Z. Wan, S. Hook, and G. Hulley. MOD11A1 MODIS/Terra Land Surface Tempera- ture/Emissivity Daily L3 Global 1km SIN Grid V006, 2015

  38. [38]

    Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S

    Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning.Chem. Sci., 9:513–530, 2018

  39. [39]

    Gromov-wasserstein learning for graph matching and node embedding

    Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin Duke. Gromov-wasserstein learning for graph matching and node embedding. InInternational conference on machine learning, pages 6932–6941. Proceedings of Machine Learning Research, 2019. 13 A Additional Details for Theoretical Properties of fsFGW Distance With a mild abuse of notation, we will write ...

  40. [40]

    +s r(T⋆ 2)),(33) GW(T3)≤C q (GW(T⋆

  41. [41]

    16 Step 4: Bound the feature term

    + GW(T⋆ 2))(34) forC q = 1whenq= 1andC q = 2q−1 forq≥2. 16 Step 4: Bound the feature term. We wish to show that forC q ≥1, λ >0, s≥0,1−α≥0, thegdefined in (31) and (32) satisfy g(s)≤g(C q(s1 +s 2))(35) ≤C q g(s1 +s 2)(36) ≤C q g(s1) +C q g(s2).(37) Because g is monotonically non-decreasing, it produces the first inequality when applied to the bound from S...

  42. [42]

    This gives the (relaxed) triangle inequality as claimed

    + GW(T⋆ 2)) =C q (fsFGW⋆(X,Y) + fsFGW ⋆(Y,Z)). This gives the (relaxed) triangle inequality as claimed. This highlights the importance of Step 1: the triangle inequality holds because the joint optimization over (T,w) reduces to a separable concave transformation of feature costs. B Algorithm and Convergence Our fsFGW algorithm is summarized in Algorithm ...