Recognition: no theorem link
Fused Gromov-Wasserstein Distance with Feature Selection
Pith reviewed 2026-05-13 05:30 UTC · model grok-4.3
The pith
Fused Gromov-Wasserstein distances with feature selection incorporate adaptive suppression weights to downweight irrelevant features during alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that adaptive feature suppression weights can be integrated into the FGW objective, either via regularization penalties or simplex constraints, to enable selective feature downweighting during alignment. This yields distances that satisfy theoretical bounds with respect to standard FGW and GW distances and maintain metric properties. Computation proceeds via an efficient alternating minimization scheme that alternates between the coupling and the weights.
What carries the argument
Adaptive feature suppression weights optimized jointly with the transport coupling in the FGW objective.
If this is right
- The modified distances satisfy explicit bounds relative to classical FGW and Gromov-Wasserstein distances.
- Metric properties are preserved under the regularized and constrained formulations.
- Feature suppression reveals task-relevant structure in applications such as computational redistricting.
- An alternating minimization algorithm computes the distances efficiently.
Where Pith is reading between the lines
- The groupwise extensions suggest the method can handle predefined feature groups without separate preprocessing.
- Interpretability improvements may apply to other structured data comparison tasks that rely on optimal transport.
- Joint optimization of weights and alignment could reduce sensitivity to feature scaling choices in practice.
Load-bearing premise
That jointly optimizing the alignment plan and the feature suppression weights will not introduce instability or break the metric properties in high-dimensional or correlated feature settings.
What would settle it
A numerical example with high-dimensional correlated features where the computed distance violates the triangle inequality or where the optimization yields degenerate zero weights for all features.
Figures
read the original abstract
Fused Gromov-Wasserstein (FGW) distances provide a principled framework for comparing objects by jointly aligning structure and node features. However, existing FGW formulations treat all features uniformly, which limits interpretability and robustness in high-dimensional settings where many features may be irrelevant or noisy. We introduce FGW distances with feature selection, which incorporate adaptive feature suppression weights into the FGW objective to selectively downweight or suppress differentiating features during alignment. We propose two approaches: (1) regularized FGW with Lasso and Ridge penalties, and (2) FGW with simplex-constrained weights, including groupwise extensions. We analyze the resulting models and establish their key theoretical properties, including bounds relative to classical FGW and Gromov-Wasserstein distances, and metric behavior. An efficient alternating minimization algorithm is developed. Experiments illustrate how feature suppression enhances interpretability and reveals task-relevant structure, with a special application to computational redistricting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Fused Gromov-Wasserstein distances with feature selection (FGW-FS) by incorporating adaptive feature suppression weights into the FGW objective to downweight or suppress differentiating features during alignment. Two approaches are proposed: regularized FGW using Lasso and Ridge penalties, and FGW with simplex-constrained weights (including groupwise extensions). The authors analyze the models to establish theoretical properties including bounds to classical FGW and Gromov-Wasserstein distances plus metric behavior, develop an efficient alternating minimization algorithm, and present experiments showing enhanced interpretability with an application to computational redistricting.
Significance. If the central claims on preserved metric properties and bounds hold under joint optimization of couplings and data-dependent feature weights, the work would meaningfully extend optimal transport tools for high-dimensional structured data by improving robustness to noisy or irrelevant features while adding interpretability, with potential impact on graph matching and alignment tasks.
major comments (2)
- [theoretical analysis section] The section establishing key theoretical properties asserts that the FGW-FS distance satisfies metric axioms and provides bounds to classical FGW/GW, but the joint alternating minimization of the coupling and the feature suppression weights (Lasso/Ridge or simplex) means the effective weights can differ across pairs in a triple. This risks violating triangle inequality, and the manuscript does not provide an explicit proof or counterexample analysis showing when the property is inherited from fixed-weight FGW.
- [model formulation and analysis] In the formulation of the regularized and simplex-constrained objectives, the claim that the optimized distance remains a metric with the stated bounds assumes the suppression vector is stable, yet no analysis addresses degeneracy when features are linearly dependent or high-dimensional, where the simplex or Lasso solutions may collapse to trivial suppressions (e.g., all weight on one feature).
minor comments (2)
- [experiments] The experimental section would benefit from explicit baseline comparisons (standard FGW without feature selection) and details on hyperparameter selection for lambda and the regularization terms to allow reproduction of the interpretability gains.
- [notation and preliminaries] Notation for the feature weights could be unified or clearly distinguished between the Lasso/Ridge variant and the simplex variant to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have identified important points for strengthening the theoretical analysis. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [theoretical analysis section] The section establishing key theoretical properties asserts that the FGW-FS distance satisfies metric axioms and provides bounds to classical FGW/GW, but the joint alternating minimization of the coupling and the feature suppression weights (Lasso/Ridge or simplex) means the effective weights can differ across pairs in a triple. This risks violating triangle inequality, and the manuscript does not provide an explicit proof or counterexample analysis showing when the property is inherited from fixed-weight FGW.
Authors: We appreciate the referee highlighting this subtlety regarding the triangle inequality. The current analysis provides bounds relative to classical FGW and GW distances and states metric properties, but the joint optimization indeed makes the effective feature weights pair-dependent, and an explicit proof accounting for this was not included. We acknowledge that this requires clarification, as the inheritance from fixed-weight FGW is not automatic. In the revised manuscript, we will expand the theoretical analysis section with either a rigorous proof demonstrating that the metric axioms (including triangle inequality) continue to hold under the joint optimization, or a counterexample showing when the property fails, together with sufficient conditions for the bounds to be preserved. revision: yes
-
Referee: [model formulation and analysis] In the formulation of the regularized and simplex-constrained objectives, the claim that the optimized distance remains a metric with the stated bounds assumes the suppression vector is stable, yet no analysis addresses degeneracy when features are linearly dependent or high-dimensional, where the simplex or Lasso solutions may collapse to trivial suppressions (e.g., all weight on one feature).
Authors: The referee correctly notes that the manuscript does not analyze potential degeneracy of the suppression weights. In high-dimensional or linearly dependent feature regimes, the Lasso, Ridge, or simplex solutions can indeed collapse to trivial or unstable suppressions, which could affect the claimed metric properties and bounds. We will revise the model formulation and analysis section to include a dedicated discussion of these degeneracy cases. This will cover conditions under which collapse occurs, strategies for choosing regularization parameters to promote stability, and any adjustments to the theoretical bounds under mild feature independence assumptions. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces FGW with feature selection by adding adaptive suppression weights to the existing FGW objective via two concrete formulations (Lasso/Ridge regularized and simplex-constrained, plus groupwise extensions). It then claims to derive bounds to classical FGW/GW and metric behavior through analysis of the new objectives. No step reduces a claimed prediction, bound, or property to a fitted input or self-definition by construction; the theoretical claims are presented as consequences of the modified objective rather than tautological renamings or self-citations. External references to prior FGW work are standard background and not load-bearing for the new results.
Axiom & Free-Parameter Ledger
free parameters (1)
- regularization strength lambda
axioms (1)
- standard math Fused Gromov-Wasserstein distance is a valid metric on the space of attributed graphs
Reference graph
Works this paper leans on
-
[1]
Tara Abrishami, Nestor Guillen, Parker Rule, Zachary Schutzman, Justin Solomon, Thomas Weighill, and Si Wu. Geometry of graph partitions via optimal transport.SIAM Journal on Scientific Computing, 42(5):A3340–A3366, 2020
work page 2020
-
[2]
Wasserstein generative adversarial networks
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Proceedings of Machine Learning Research, 2017
work page 2017
-
[3]
Efficient trajectory inference in wasserstein space using consecutive averaging
Amartya Banerjee, Harlin Lee, Nir Sharon, and Caroline Moosmüller. Efficient trajectory inference in wasserstein space using consecutive averaging. InInternational Conference on Artificial Intelligence and Statistics, pages 2260–2268. Proceedings of Machine Learning Research, 2025
work page 2025
-
[4]
Borgwardt, Cheng Soon Ong, Stefan Schönauer, S
Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schönauer, S. V . N. Vishwanathan, Alex J. Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels.Bioinformatics, 21(1):47–56, January 2005
work page 2005
-
[5]
Learning to predict graphs with fused gromov-wasserstein barycenters
Luc Brogat-Motte, Rémi Flamary, Céline Brouard, Juho Rousu, and Florence D’Alché-Buc. Learning to predict graphs with fused gromov-wasserstein barycenters. InInternational Con- ference on Machine Learning, pages 2321–2335. Proceedings of Machine Learning Research, 2022
work page 2022
-
[6]
Junyu Chen, Binh T Nguyen, Shang H Koh, and Yong S Soh. Semidefinite relaxations of the gromov-wasserstein distance.Advances in Neural Information Processing Systems, 37:69814– 69839, 2024
work page 2024
-
[7]
Clark, Tom Needham, and Thomas Weighill
Ranthony A. Clark, Tom Needham, and Thomas Weighill. Generalized dimension reduction using semi-relaxed gromov-wasserstein distance.Proceedings of the AAAI Conference on Artificial Intelligence, 39(15):16082–16090, Apr. 2025
work page 2025
-
[8]
Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation.Advances in neural information processing systems, 30, 2017
work page 2017
-
[9]
K. Didan. MOD13Q1 MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V006, 2015
work page 2015
-
[10]
Paul D. Dobson and Andrew J. Doig. Distinguishing enzyme structures from non-enzymes without alignments.Journal of Molecular Biology, 330(4):771–783, 2003
work page 2003
-
[11]
Compressed sensing.IEEE Transactions on information theory, 52(4):1289– 1306, 2006
David L Donoho. Compressed sensing.IEEE Transactions on information theory, 52(4):1289– 1306, 2006
work page 2006
-
[12]
Moon Duchin, Olivia Walch, et al.Political geometry: rethinking redistricting in the US with math, law, and everything in between, volume 3. Springer, 2022
work page 2022
-
[13]
C. D. Elvidge, M. Zhizhin, T. Ghosh, and F.-C. Hsu. Annual time series of global VIIRS nighttime lights derived from monthly averages: 2012 to 2019.Remote Sensing, 13(5):922, 2021
work page 2012
-
[14]
Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019
work page 2019
-
[15]
Matthias Fey, Jinu Sunil, Akihiro Nitta, Rishi Puri, Manan Shah, Blaz Stojanovic, Ramona Bendias, Barghi Alexandria, Vid Kocijan, Zecheng Zhang, Xinwei He, Jan E. Lenssen, and Jure Leskovec. PyG 2.0: Scalable learning on real world graphs. InTemporal Graph Learning Workshop @ KDD, 2025
work page 2025
-
[16]
Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...
work page 2021
-
[17]
POT python optimal transport (version 0.9.5), 2024
Rémi Flamary, Cédric Vincent-Cuaz, Nicolas Courty, Alexandre Gramfort, Oleksii Kachaiev, Huy Quang Tran, Laurène David, Clément Bonet, Nathan Cassereau, Théo Gnassounou, Eloi Tanguy, Julie Delon, Antoine Collas, Sonia Mazelet, Laetitia Chapel, Tanguy Kerdoncuff, Xizheng Yu, Matthew Feickert, Paul Krzakala, Tianlin Liu, and Eduardo Fernandes Montesuma. POT...
work page 2024
-
[18]
Peyré Gabriel and Cuturi Marco. Computational optimal transport with applications to data sciences.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019
work page 2019
-
[19]
N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. Google Earth Engine: Planetary-scale geospatial analysis for everyone.Remote Sensing of Environment, 202:18–27, 2017
work page 2017
-
[20]
Ridge regression: applications to nonorthogonal problems.Technometrics, 12(1):69–82, 1970
Arthur E Hoerl and Robert W Kennard. Ridge regression: applications to nonorthogonal problems.Technometrics, 12(1):69–82, 1970
work page 1970
-
[21]
arXiv preprint arXiv:2103.09430 (2021)
Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb- lsc: A large-scale challenge for machine learning on graphs.arXiv preprint arXiv:2103.09430, 2021
-
[22]
Open graph benchmark: Datasets for machine learning on graphs
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020
work page 2020
- [23]
-
[24]
Convergence rate of frank-wolfe for non-convex objectives.arXiv preprint arXiv:1607.00345, 2016
Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives.arXiv preprint arXiv:1607.00345, 2016
-
[25]
Jiajin Li, Jianheng Tang, Lemin Kong, Huikang Liu, Jia Li, Anthony Man-Cho So, and Jose H. Blanchet. A convergent single-loop algorithm for relaxation of gromov-wasserstein in graph data. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023
work page 2023
-
[26]
Xinyu Ma, Xu Chu, Yasha Wang, Yang Lin, Junfeng Zhao, Liantao Ma, and Wenwu Zhu. Fused gromov-wasserstein graph mixup for graph-level classifications.Advances in Neural Information Processing Systems, 36:15252–15276, 2023
work page 2023
-
[27]
Gromov–wasserstein distances and the metric approach to object matching
Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417–487, 2011
work page 2011
-
[28]
Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann
Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. InICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020
work page 2020
-
[29]
National Cancer Institute Developmental Therapeutics Program. AIDS Antiviral Screen Data. http://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+ Data, 2017. Accessed: 2017-09-27
work page 2017
-
[30]
Francesco Orsini, Paolo Frasconi, and Luc De Raedt. Graph invariant kernels. InProceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, page 3756–3762. AAAI Press, 2015
work page 2015
-
[31]
Gromov-wasserstein averaging of kernel and distance matrices
Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. InInternational conference on machine learning, pages 2664–2672. Proceedings of Machine Learning Research, 2016
work page 2016
-
[32]
Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996. 12
work page 1996
-
[33]
Optimal transport for structured data with application on graphs
Vayer Titouan, Nicolas Courty, Romain Tavenard, Chapel Laetitia, and Rémi Flamary. Optimal transport for structured data with application on graphs. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6275–6284. Proceedings...
work page 2019
-
[34]
U.S. Census Bureau. American community survey 5-year estimates, 2016–2020. https: //www.census.gov/programs-surveys/acs, 2021
work page 2016
-
[35]
Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020
Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9):212, 2020
work page 2020
-
[36]
Cédric Villani et al.Optimal transport: old and new, volume 338. Springer, 2009
work page 2009
-
[37]
Z. Wan, S. Hook, and G. Hulley. MOD11A1 MODIS/Terra Land Surface Tempera- ture/Emissivity Daily L3 Global 1km SIN Grid V006, 2015
work page 2015
-
[38]
Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S
Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning.Chem. Sci., 9:513–530, 2018
work page 2018
-
[39]
Gromov-wasserstein learning for graph matching and node embedding
Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin Duke. Gromov-wasserstein learning for graph matching and node embedding. InInternational conference on machine learning, pages 6932–6941. Proceedings of Machine Learning Research, 2019. 13 A Additional Details for Theoretical Properties of fsFGW Distance With a mild abuse of notation, we will write ...
work page 2019
-
[40]
+s r(T⋆ 2)),(33) GW(T3)≤C q (GW(T⋆
-
[41]
16 Step 4: Bound the feature term
+ GW(T⋆ 2))(34) forC q = 1whenq= 1andC q = 2q−1 forq≥2. 16 Step 4: Bound the feature term. We wish to show that forC q ≥1, λ >0, s≥0,1−α≥0, thegdefined in (31) and (32) satisfy g(s)≤g(C q(s1 +s 2))(35) ≤C q g(s1 +s 2)(36) ≤C q g(s1) +C q g(s2).(37) Because g is monotonically non-decreasing, it produces the first inequality when applied to the bound from S...
-
[42]
This gives the (relaxed) triangle inequality as claimed
+ GW(T⋆ 2)) =C q (fsFGW⋆(X,Y) + fsFGW ⋆(Y,Z)). This gives the (relaxed) triangle inequality as claimed. This highlights the importance of Step 1: the triangle inequality holds because the joint optimization over (T,w) reduces to a separable concave transformation of feature costs. B Algorithm and Convergence Our fsFGW algorithm is summarized in Algorithm ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.