metasnf: Meta Clustering with Similarity Network Fusion in R

Adam Taback; Ana Patricia Balbon; Anne L Wheeler; Bo Wang; Brian Cox; Colin Hawco; Daniel Felsky; Denise Sabac; Lauren Erdman; Maria T Secara

arxiv: 2410.17976 · v2 · submitted 2024-10-23 · 📊 stat.CO · cs.LG

metasnf: Meta Clustering with Similarity Network Fusion in R

Prashanth S Velayudhan , Xiaoqiao Xu , Prajkta Kallurkar , Ana Patricia Balbon , Maria T Secara , Adam Taback , Denise Sabac , Nicholas Chan

show 8 more authors

Shihao Ma Bo Wang Daniel Felsky Stephanie H Ameis Brian Cox Colin Hawco Lauren Erdman Anne L Wheeler

This is my paper

Pith reviewed 2026-05-23 18:59 UTC · model grok-4.3

classification 📊 stat.CO cs.LG

keywords meta clusteringsimilarity network fusionSNFR packagecluster analysismulti-modal data integrationsubtype discovery

0 comments

The pith

The metasnf R package applies meta clustering to SNF solutions so users can select clusters by context-specific usefulness instead of standard quality metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an R package that performs meta clustering on the outputs of similarity network fusion workflows. Meta clustering treats many different cluster solutions as objects to be clustered, revealing groups of similar solutions. This lets researchers choose among SNF-derived clusters according to criteria that matter in their specific biomedical or other setting. The package supplies supporting functions for visualizing, characterizing, and validating the resulting clusters. Its central goal is to move cluster selection away from generic internal metrics toward utility defined by the user's context.

Core claim

Meta clustering of SNF cluster solutions surfaces groupings that align with context-specific utility criteria rather than context-agnostic measures of cluster quality.

What carries the argument

Meta clustering applied to a collection of SNF-derived cluster solutions, where the solutions themselves become the input data for a second round of clustering.

If this is right

Users can search a wider range of SNF cluster solutions without manual inspection of each one.
Cluster selection can incorporate external or domain-specific criteria instead of relying solely on silhouette scores or similar measures.
The same meta-clustering step can be combined with the package's visualization and validation tools to inspect the chosen groupings.
SNf-based subtype discovery workflows gain an additional layer that organizes solutions by similarity before final selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on non-biomedical multi-modal datasets to check whether the utility advantage holds outside the paper's primary domain.
If meta-clusters correspond to distinct biological mechanisms, downstream analyses might focus on one meta-cluster at a time rather than on individual solutions.
The method assumes that proximity in the space of cluster solutions correlates with similarity in practical usefulness, which could be checked by comparing meta-cluster membership against independent utility labels.

Load-bearing premise

Clustering the cluster solutions themselves will reliably produce groups that are more useful under context-specific criteria than solutions chosen by standard quality metrics.

What would settle it

On a dataset with an independently measured context-specific utility score, the meta-clustered solutions show no higher average utility than solutions chosen by conventional internal validation metrics.

Figures

Figures reproduced from arXiv: 2410.17976 by Adam Taback, Ana Patricia Balbon, Anne L Wheeler, Bo Wang, Brian Cox, Colin Hawco, Daniel Felsky, Denise Sabac, Lauren Erdman, Maria T Secara, Nicholas Chan, Prajkta Kallurkar, Prashanth S Velayudhan, Shihao Ma, Stephanie H Ameis, Xiaoqiao Xu.

**Figure 2.** Figure 2: Heatmap of adjusted Rand indices, partitioned into meta clusters A-E. [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Annotated heatmap of adjusted Rand indices. [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Annotated heatmap of adjusted Rand indices with a manually generated annotation [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Manhattan plot showing separation of all features for representative cluster solutions [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Manhattan plot showing separation of all features for representative cluster solutions [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Heatmap of p-values of associations between features in target list and metasnf [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Side-by-side heatmaps of the settings matrix and corresponding adjusted Rand [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: ARI values calculated between cluster solutions of subsampled data. [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: Density plot showing the distribution of average co-clustering across all subsamples [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

**Figure 11.** Figure 11: Annotated heatmap showing how often all pairs of observations clustered together [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗

**Figure 12.** Figure 12: Example plots generated by auto plotting functionality. a) A jitter plot showing the [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗

**Figure 13.** Figure 13: Heatmap of p-values of pairwise associations between input features in a provided [PITH_FULL_IMAGE:figures/full_fig_p043_13.png] view at source ↗

**Figure 14.** Figure 14: Heatmap of p-values of pairwise associations between input features in a provided [PITH_FULL_IMAGE:figures/full_fig_p044_14.png] view at source ↗

**Figure 15.** Figure 15: Alluvial plot showing the distribution of observations across different numbered [PITH_FULL_IMAGE:figures/full_fig_p046_15.png] view at source ↗

**Figure 16.** Figure 16: Direct plot of the similarity matrix generated [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗

**Figure 17.** Figure 17: Example plot generated by esm_manhattan(). Rows of the extended solutions matrix are provided as inputs and feature separation for each solution is plotted on the same set of axes. Solutions are separated by colour. The black and red horizontal lines are placed at the Bonferroni and unadjusted equivalents of p = 0.05 respectively [PITH_FULL_IMAGE:figures/full_fig_p049_17.png] view at source ↗

**Figure 18.** Figure 18: Example plot generated by var_manhattan(). One primary feature is specified and the association of that feature with all other features in a provided data list are plotted. The red horizontal line is placed at the unadjusted equivalent of p = 0.05. 13. Conclusion [PITH_FULL_IMAGE:figures/full_fig_p049_18.png] view at source ↗

**Figure 19.** Figure 19: Runtime scaling plots showing (A) relationships between time to generate cluster [PITH_FULL_IMAGE:figures/full_fig_p054_19.png] view at source ↗

read the original abstract

metasnf is an R package that enables users to apply meta clustering, a method for efficiently searching a broad space of cluster solutions by clustering the solutions themselves, to clustering workflows based on similarity network fusion (SNF). SNF is a multi-modal data integration algorithm commonly used for biomedical subtype discovery. The package also contains functions to assist with cluster visualization, characterization, and validation. This package can help researchers identify SNF-derived cluster solutions that are guided by context-specific utility over context-agnostic measures of quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a software announcement for an R package that applies meta-clustering to SNF workflows, with no benchmarks or new methods.

read the letter

The paper introduces metasnf, an R package that lets users run meta-clustering on the outputs of similarity network fusion runs and bundles in some visualization and validation helpers. The central feature is the ability to cluster cluster solutions themselves so that users can search a wider space guided by their own context-specific criteria rather than generic quality scores. That is the actual new piece: a packaged implementation targeted at SNF users in biomedical work. The package appears to do the straightforward job of making that workflow available in R without requiring users to code the meta-clustering step themselves. The supporting functions for characterization and plotting are the kind of practical additions that can save time for people already running SNF pipelines. The main limitation is that the manuscript is only a description of the package. There are no benchmarks against other cluster-selection approaches, no worked examples on real data showing improved downstream utility, and no analysis of how often the meta-clustered solutions differ meaningfully from those picked by standard metrics. The claim that this approach surfaces more useful groupings therefore rests on the untested premise that meta-clustering will reliably do what the authors expect. The paper does not contradict itself or invent new entities, but it also does not supply evidence that would let a reader judge whether the tool delivers on its stated goal. This is aimed at computational biologists and statisticians who already use SNF and want a ready-made way to explore multiple clusterings. A methods reader looking for algorithmic advances or empirical validation will find little. I would send it for peer review at a software or tools journal where the code and usability can be checked, but it does not look like a candidate for a general research methods venue without added validation work.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the metasnf R package, which implements meta-clustering on SNF-derived cluster solutions to enable efficient search over a broad space of clustering outcomes from multi-modal data integration. It also provides supporting functions for cluster visualization, characterization, and validation. The central claim is that this approach allows identification of SNF cluster solutions guided by context-specific utility rather than standard, context-agnostic quality metrics.

Significance. If the package implements the described functionality correctly, it supplies a practical R tool for biomedical researchers using SNF for subtype discovery, addressing the common issue of selecting among many possible cluster solutions. The work is a software contribution rather than a methodological advance or empirical demonstration; no machine-checked proofs, reproducible benchmarks, or falsifiable predictions are included.

major comments (2)

[Abstract] Abstract: the assertion that the package 'can help researchers identify SNF-derived cluster solutions that are guided by context-specific utility over context-agnostic measures of quality' is presented as a capability without any accompanying code examples, simulated or real-data demonstrations, or comparisons showing that meta-clustering yields solutions preferred under context-specific criteria.
The manuscript supplies no validation results, error analysis, or benchmarks against existing SNF or clustering packages, leaving the practical utility of the meta-clustering workflow unsupported by evidence.

minor comments (1)

The manuscript would be strengthened by including at least one worked example (e.g., code and output) illustrating the meta-clustering workflow on a small dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review of our manuscript on the metasnf R package. We address the major comments below, noting that this is a software description paper focused on providing practical tools for SNF workflows rather than a methodological contribution with new empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the package 'can help researchers identify SNF-derived cluster solutions that are guided by context-specific utility over context-agnostic measures of quality' is presented as a capability without any accompanying code examples, simulated or real-data demonstrations, or comparisons showing that meta-clustering yields solutions preferred under context-specific criteria.

Authors: We agree that the abstract claim would be strengthened by explicit support. The package is designed to enable users to apply meta-clustering and then evaluate solutions using any context-specific criteria they define (e.g., alignment with external labels or domain knowledge). In the revised manuscript we will add a dedicated section with code examples and a simulated-data demonstration illustrating this workflow. revision: yes
Referee: The manuscript supplies no validation results, error analysis, or benchmarks against existing SNF or clustering packages, leaving the practical utility of the meta-clustering workflow unsupported by evidence.

Authors: As a software contribution paper, the primary goal is to document the implemented functionality rather than to conduct comparative benchmarks. We acknowledge that the current manuscript does not include validation results or error analysis. To address the concern we will incorporate a short example workflow section that applies standard cluster validation metrics to meta-clustered solutions on simulated data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; software description only

full rationale

The manuscript is a package announcement describing the metasnf R package and its meta-clustering functionality for SNF workflows. It contains no equations, no derivation chain, no fitted parameters presented as predictions, and no load-bearing self-citations of mathematical results. The central claim is purely descriptive of what the software enables (searching cluster solutions via meta-clustering for context-specific utility), with no reduction of any asserted result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is a software packaging of existing clustering methods.

pith-pipeline@v0.9.0 · 5667 in / 993 out tokens · 21040 ms · 2026-05-23T18:59:01.775139+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

metasnf is an R package that enables users to apply meta clustering... to clustering workflows based on similarity network fusion (SNF).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The meta clustering procedure proposed by Caruana et al. (2006) to address... disparities between context-agnostic metrics of cluster quality and context-specific usefulness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

summarize_dl: Return a summary of the data present in adata_list. R> summarize_dl(data_list) name type domain length width 1 subcortical_volume continuous neuroimaging 87 31 2 household_income continuous demographics 87 2 3 pubertal_status continuous demographics 87 2 R> summarize_dl(data_list, scope = "feature") name type domain 1 smri_vol_scs_cbwmatterl...

work page
[2]

tbl_df"

collapse_dl: Convert adata_list to a single data frame. R> collapse_dl(data_list) R> class(collapse_dl(data_list)) "tbl_df" "tbl" "data.frame" C. Alternative formats for data list generation C.1. Named nested components Explicitly specifying each nested component during data list creation can improve code legi- bility. R> library("metasnf") R> heart_rate_...

work page 1990
[3]

spectral_eigen: Number of clusters determined by eigen-gap heuristic

work page
[4]

spectral_rot: Number of clusters determined by rotation cost heuristic

work page
[5]

spectral_two: Yields a two cluster solution 59

work page
[6]

spectral_three: Yields a three cluster solution

work page
[7]

two_cluster_spectral

And so on, up tospectral_eight. A custom clustering algorithms list can be created by adding more clustering algorithms to sample from: R> clust_algs_list <- generate_clust_algs_list( + "two_cluster_spectral" = spectral_two, + "five_cluster_spectral" = spectral_five + ) R> summarize_clust_algs_list(clust_algs_list) alg_number algorithm 1 1 spectral_eigen ...

work page
[8]

The function takes a singleN × N similarity (not distance) matrix as its only input

work page
[9]

solution

The function returns a named list with two components: • The first item (named "solution") is a single N-dimensional vector of numbers corresponding to the observations in the similarity matrix • The second item (named "nclust") is a single integer indicating the number of clusters that the algorithm is supposed to have generated The function should not t...

work page
[10]

euclidean_distance Discrete distances:

work page
[11]

euclidean_distance Ordinal distances:

work page
[14]

The first layer of the list contains one element per feature type

gower_distance The distance metrics list is a named, nested list. The first layer of the list contains one element per feature type. Each of those elements contains a list of any number of distance calculating 62 Meta Clustering with SNF in R functions. The distance calculating functions themselves accept raw input data frames and a vector of feature weig...

work page
[15]

euclidean_distance (Euclidean distance):

work page
[16]

sn_euclidean_distance (Standardized and normalized Euclidean distance): • Standardizes and normalizes data prior to Euclidean distance calculation

work page
[17]

gower_distance (Gower’s distance):

work page
[18]

siw_euclidean_distance (Squared, including weights, Euclidean distance) • Applyfeatureweightsifprovidedtodataframe, thencalculatesEuclideandistance, then squares the results

work page
[19]

sew_euclidean_distance (Squared, excluding weights, Euclidean distance) • Apply square root of feature weights to data frame, then calculates Euclidean distance, then squares the results

work page
[20]

standard_norm_euclidean

hamming_distance (Hamming distance) Any of these functions can be accessed upon loadingmetasnf and can be formatted into a custom distance_metrics_list as follows: R> my_distance_metrics <- generate_distance_metrics_list( + continuous_distances = list( + "standard_norm_euclidean" = sn_euclidean_distance + ), + discrete_distances = list( + "standard_norm_e...

work page
[21]

standard_norm_euclidean 63 Discrete distances:

work page
[22]

standard_norm_euclidean Ordinal distances:

work page
[23]

euclidean_distance Categorical distances:

work page
[24]

gower_distance Mixed distances:

work page
[25]

In this case, users must ensure that at least one metric is provided for each of the 5 recognized feature types (continuous, discrete, ordinal, categorical, and mixed)

gower_distance To replace the default distance metrics rather than add on to them, thekeep_defaults parameter can be set to FALSE during distance metrics list generation. In this case, users must ensure that at least one metric is provided for each of the 5 recognized feature types (continuous, discrete, ordinal, categorical, and mixed). This distance met...

work page
[26]

The first parameter,df, is a data frame that contains no UID column and contains an arbitrary number of feature columns

work page
[27]

patient_id

The second parameter,weights_row is a named vector of weights corresponding to the features in df. While it is necessary for the function to accept aweights_row, it is not necessary for the function to make use of this row. When writing code to apply weights to features within a function, one common approach is to convert theweights_row to a diagonal matr...

work page 2006

[1] [1]

summarize_dl: Return a summary of the data present in adata_list. R> summarize_dl(data_list) name type domain length width 1 subcortical_volume continuous neuroimaging 87 31 2 household_income continuous demographics 87 2 3 pubertal_status continuous demographics 87 2 R> summarize_dl(data_list, scope = "feature") name type domain 1 smri_vol_scs_cbwmatterl...

work page

[2] [2]

tbl_df"

collapse_dl: Convert adata_list to a single data frame. R> collapse_dl(data_list) R> class(collapse_dl(data_list)) "tbl_df" "tbl" "data.frame" C. Alternative formats for data list generation C.1. Named nested components Explicitly specifying each nested component during data list creation can improve code legi- bility. R> library("metasnf") R> heart_rate_...

work page 1990

[3] [3]

spectral_eigen: Number of clusters determined by eigen-gap heuristic

work page

[4] [4]

spectral_rot: Number of clusters determined by rotation cost heuristic

work page

[5] [5]

spectral_two: Yields a two cluster solution 59

work page

[6] [6]

spectral_three: Yields a three cluster solution

work page

[7] [7]

two_cluster_spectral

And so on, up tospectral_eight. A custom clustering algorithms list can be created by adding more clustering algorithms to sample from: R> clust_algs_list <- generate_clust_algs_list( + "two_cluster_spectral" = spectral_two, + "five_cluster_spectral" = spectral_five + ) R> summarize_clust_algs_list(clust_algs_list) alg_number algorithm 1 1 spectral_eigen ...

work page

[8] [8]

The function takes a singleN × N similarity (not distance) matrix as its only input

work page

[9] [9]

solution

The function returns a named list with two components: • The first item (named "solution") is a single N-dimensional vector of numbers corresponding to the observations in the similarity matrix • The second item (named "nclust") is a single integer indicating the number of clusters that the algorithm is supposed to have generated The function should not t...

work page

[10] [10]

euclidean_distance Discrete distances:

work page

[11] [11]

euclidean_distance Ordinal distances:

work page

[12] [14]

The first layer of the list contains one element per feature type

gower_distance The distance metrics list is a named, nested list. The first layer of the list contains one element per feature type. Each of those elements contains a list of any number of distance calculating 62 Meta Clustering with SNF in R functions. The distance calculating functions themselves accept raw input data frames and a vector of feature weig...

work page

[13] [15]

euclidean_distance (Euclidean distance):

work page

[14] [16]

sn_euclidean_distance (Standardized and normalized Euclidean distance): • Standardizes and normalizes data prior to Euclidean distance calculation

work page

[15] [17]

gower_distance (Gower’s distance):

work page

[16] [18]

siw_euclidean_distance (Squared, including weights, Euclidean distance) • Applyfeatureweightsifprovidedtodataframe, thencalculatesEuclideandistance, then squares the results

work page

[17] [19]

sew_euclidean_distance (Squared, excluding weights, Euclidean distance) • Apply square root of feature weights to data frame, then calculates Euclidean distance, then squares the results

work page

[18] [20]

standard_norm_euclidean

hamming_distance (Hamming distance) Any of these functions can be accessed upon loadingmetasnf and can be formatted into a custom distance_metrics_list as follows: R> my_distance_metrics <- generate_distance_metrics_list( + continuous_distances = list( + "standard_norm_euclidean" = sn_euclidean_distance + ), + discrete_distances = list( + "standard_norm_e...

work page

[19] [21]

standard_norm_euclidean 63 Discrete distances:

work page

[20] [22]

standard_norm_euclidean Ordinal distances:

work page

[21] [23]

euclidean_distance Categorical distances:

work page

[22] [24]

gower_distance Mixed distances:

work page

[23] [25]

In this case, users must ensure that at least one metric is provided for each of the 5 recognized feature types (continuous, discrete, ordinal, categorical, and mixed)

gower_distance To replace the default distance metrics rather than add on to them, thekeep_defaults parameter can be set to FALSE during distance metrics list generation. In this case, users must ensure that at least one metric is provided for each of the 5 recognized feature types (continuous, discrete, ordinal, categorical, and mixed). This distance met...

work page

[24] [26]

The first parameter,df, is a data frame that contains no UID column and contains an arbitrary number of feature columns

work page

[25] [27]

patient_id

The second parameter,weights_row is a named vector of weights corresponding to the features in df. While it is necessary for the function to accept aweights_row, it is not necessary for the function to make use of this row. When writing code to apply weights to features within a function, one common approach is to convert theweights_row to a diagonal matr...

work page 2006