From Layers to Networks: Comparing Neural Representations via Diffusion Geometry

Atharva Khandait; Jan E. Gerken

arxiv: 2605.15901 · v1 · pith:Y24ZAXAKnew · submitted 2026-05-15 · 💻 cs.LG

From Layers to Networks: Comparing Neural Representations via Diffusion Geometry

Atharva Khandait , Jan E. Gerken This is my paper

Pith reviewed 2026-05-20 19:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords representational similarity matricesMarkov transition matricesmulti-scale geometrylayer fusionneural network comparisondiffusion operatorscentered kernel alignment

0 comments

The pith

Representational similarity matrices admit an exact rewrite as row-stochastic Markov matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a wide range of measures for comparing neural representations, built on pairwise similarity matrices, possess an exact alternative expression using row-stochastic Markov matrices. These matrices encode random-walk transitions and thereby let the same measures examine representation geometry at different scales simply by raising the matrix to successive powers. The same rewrite further permits the matrices from several layers to be combined through alternating diffusion into one operator that reflects the network's overall sample geometry. A sympathetic reader would care because the change moves comparisons from isolated layers toward whole-network descriptions and supplies adjustable resolution without redesigning the underlying similarity function.

Core claim

The central claim is that a broad class of similarity measures based on representational similarity matrices admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices. This equivalence permits multi-scale variants of measures such as Centered Kernel Alignment and Distance Correlation that use the t-th power of the transition matrix to examine geometry at chosen diffusion scales. It also supports fused variants in which the Markov matrices of multiple layers are merged via alternating diffusion into a single operator that captures the network's joint sample geometry, thereby shifting the comparison task from layer-to-layer to network-to-network.

What carries the argument

The row-stochastic Markov matrix obtained directly from a representational similarity matrix, serving as the transition operator that encodes geometry for diffusion at adjustable scales.

If this is right

Existing similarity measures acquire multi-scale versions that examine representation geometry at chosen diffusion times by matrix powering.
Multiple layers can be fused into one operator that encodes their joint sample geometry for a single network-level comparison.
Comparisons move from matching individual layers to matching entire networks as unified objects.
The same operators apply unchanged to out-of-distribution data without requiring new similarity definitions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Markov reformulation could be checked on similarity measures outside the two explicitly extended in the work.
Fused operators might be used to track how geometric structure evolves with network depth in a single computation.
The approach suggests a route to comparing networks trained on different tasks by aligning their combined diffusion geometries rather than single-layer snapshots.

Load-bearing premise

Converting a representational similarity matrix into its row-stochastic Markov form preserves the geometric relations needed for meaningful multi-scale and cross-layer comparisons.

What would settle it

If the multi-scale or fused Markov-based measures produce systematically lower agreement with human or task-based judgments than the original similarity measures when tested on networks whose representational differences are independently known, the claim that the conversion preserves essential geometry would be refuted.

read the original abstract

Diffusion geometry is a manifold learning framework that uses random walks defined by Markov transition matrices to characterize the geometry of a dataset at multiple scales. We use diffusion geometry for neural representations, incorporating tools from multi-view learning into this field for the first time. Our key technical observation is that a broad class of similarity measures based on representational similarity matrices (RSMs) admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices, opening the door to manipulations from diffusion geometry. As a first application, we develop multi-scale variants of Centered Kernel Alignment and Distance Correlation, which utilise the $t^{th}$ power of the underlying transition matrix to probe the data geometry at adjustable diffusion scales. Going further, we introduce variants of these measures which fuse the Markov matrices of several layers via alternating diffusion into a single operator that captures the network's joint sample geometry, allowing similarity to be computed across multiple layers and shifting the comparison from layer-to-layer to network-to-network. We perform extensive numerical experiments, evaluating our measures on the Representational Similarity (ReSi) benchmark comprising 14 architectures trained on 7 datasets across three different domains. Our methods achieve SoTA results in accuracy and output correlation for both language and vision tasks across different models. We furthermore show SoTA performance on an additional benchmark evaluating on out-of-distribution data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper recasts RSM similarities as Markov matrices to enable diffusion-based multi-scale analysis and layer fusion for network comparisons, with competitive benchmark results but an open question on what row normalization discards.

read the letter

The main point is that they map a range of RSM-based similarities to row-stochastic Markov matrices, then use powers of those matrices for adjustable diffusion scales and alternating diffusion to fuse layers into a single network-level operator. This moves comparisons from layer pairs to whole networks in one step. The experiments give it weight. They evaluate on the ReSi benchmark across 14 architectures, 7 datasets, and three domains, reporting state-of-the-art accuracy and correlation on vision and language tasks plus solid out-of-distribution results. That shows the measures are usable in practice and not just theoretical tweaks. The closed-form rewrite and the alternating-diffusion fusion step are the clearest additions relative to prior RSM work. The soft spot is the row-normalization that produces the Markov matrix. If row sums reflect density, prototype strength, or class-conditional concentration in the representation space, that step can remove geometry the diffusion is meant to examine at different scales. The abstract states the equivalence but does not show whether they tested for distortion or kept the degree information in some form. A reader would want to see the full derivation and any ablation that compares normalized versus unnormalized versions. This paper is for people already working on representational similarity who want multi-scale or cross-layer tools. Someone comparing models across domains or looking for manifold-style extensions of CKA would get direct value from the methods and the benchmark numbers. It deserves a serious referee because the experiments are broad and the technical move is a straightforward extension of diffusion geometry that does not contradict itself on its own terms.

Referee Report

3 major / 2 minor

Summary. The paper claims that a broad class of similarity measures based on representational similarity matrices (RSMs) admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices. This enables multi-scale variants of Centered Kernel Alignment (CKA) and Distance Correlation using the t-th power of the transition matrix to probe geometry at adjustable scales, and layer fusion via alternating diffusion into a single operator for network-to-network comparisons. Experiments on the ReSi benchmark (14 architectures, 7 datasets, 3 domains) report state-of-the-art accuracy and output correlation for vision and language tasks, plus strong performance on out-of-distribution data.

Significance. If the equivalence derivation is exact and the row-stochastic normalization preserves task-relevant geometric structure, the framework would provide a principled way to extend RSM-based measures to multi-scale analysis and cross-network fusion, potentially improving robustness over single-scale CKA or DCor. The reported SoTA results on a broad benchmark suggest empirical utility if the claims are reproducible and the normalization artifacts are shown to be negligible.

major comments (3)

[§3] §3 (Derivation of equivalence): The central claim that RSM similarities admit an exact closed-form rewrite as row-stochastic Markov matrix P (via row-normalization of S) must explicitly demonstrate that this step does not discard sample-degree or density information that encodes task-relevant geometry; otherwise the subsequent multi-scale (P^t) and alternating-diffusion constructions rest on an unverified assumption.
[§5] §5 (Experimental results): The SoTA claims on the ReSi benchmark and OOD evaluation provide no error bars, statistical significance tests, or data-exclusion criteria, making it impossible to assess whether the reported gains over baselines are robust or driven by specific dataset/model subsets.
[§4.2] §4.2 (Alternating diffusion fusion): The construction of the fused operator from multiple layer Markov matrices requires a precise definition of the alternating diffusion procedure and a proof or empirical check that the resulting joint geometry remains faithful for cross-architecture comparison.

minor comments (2)

[Notation] Clarify notation for the transition matrix P and its exact relation to the original similarity matrix S in all equations and pseudocode.
[Methods] Add a brief discussion of computational complexity for the diffusion powers and alternating diffusion steps when applied to large representation matrices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and planned revisions to improve the manuscript.

read point-by-point responses

Referee: [§3] §3 (Derivation of equivalence): The central claim that RSM similarities admit an exact closed-form rewrite as row-stochastic Markov matrix P (via row-normalization of S) must explicitly demonstrate that this step does not discard sample-degree or density information that encodes task-relevant geometry; otherwise the subsequent multi-scale (P^t) and alternating-diffusion constructions rest on an unverified assumption.

Authors: We agree that explicit verification is warranted. In the revision we will expand §3 with a derivation showing that row-normalization P = D^{-1}S (D the diagonal degree matrix) is the standard construction from diffusion maps, which incorporates local sample density rather than discarding it. The resulting transition probabilities preserve intrinsic geometry while correcting for non-uniform sampling; we will include a brief analytic argument and a low-dimensional example confirming that cluster structure relevant to downstream tasks is retained under powering. revision: yes
Referee: [§5] §5 (Experimental results): The SoTA claims on the ReSi benchmark and OOD evaluation provide no error bars, statistical significance tests, or data-exclusion criteria, making it impossible to assess whether the reported gains over baselines are robust or driven by specific dataset/model subsets.

Authors: The referee correctly identifies a presentational gap. We will recompute all ReSi and OOD results over multiple random seeds, report means with standard-deviation error bars, add paired statistical tests (t-tests or Wilcoxon signed-rank) against baselines, and state explicitly that the full set of 14 architectures, 7 datasets, and all OOD splits were retained without exclusion. revision: yes
Referee: [§4.2] §4.2 (Alternating diffusion fusion): The construction of the fused operator from multiple layer Markov matrices requires a precise definition of the alternating diffusion procedure and a proof or empirical check that the resulting joint geometry remains faithful for cross-architecture comparison.

Authors: We will rewrite §4.2 to give an algorithmic definition of the alternating-diffusion fusion (iterated application of layer-specific transition matrices with normalization at each step). In addition we will insert a controlled empirical check on synthetic data with known joint geometry, demonstrating that the fused operator recovers cross-layer correspondences more accurately than single-layer baselines and introduces no systematic distortion for cross-architecture use. revision: yes

Circularity Check

0 steps flagged

No circularity: equivalence is a direct mathematical rewriting, not a reduction by construction

full rationale

The paper's central technical step is the observation that RSM-based similarities admit a closed-form rewrite using row-stochastic Markov matrices obtained via standard row-normalization. This is a definitional transformation (P = D^{-1}S or equivalent) that enables subsequent diffusion operators such as P^t and alternating diffusion; it does not derive a new quantity from fitted parameters or self-citations that then loops back to the original claim. The multi-scale CKA/DCor variants and network-to-network fusion are explicit extensions built on this rewrite, with the paper providing empirical validation on the ReSi benchmark rather than relying on tautological equivalence. No self-definitional loop, fitted-input prediction, or load-bearing self-citation is present in the derivation chain. The row-normalization step is a standard construction whose information-loss properties are a separate validity question, not evidence of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical equivalence between RSM-based similarities and row-stochastic Markov matrices plus the validity of diffusion powers and alternating diffusion for capturing multi-scale and joint geometry; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Similarity measures based on representational similarity matrices admit a closed-form equivalent formulation in terms of row-stochastic Markov matrices.
This equivalence is the key technical observation that enables all subsequent diffusion-geometry manipulations.

pith-pipeline@v0.9.0 · 5765 in / 1274 out tokens · 33691 ms · 2026-05-20T19:58:01.001240+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 ... P(S) = 1q^T + α(S) C_q(S) ... m_RSM(S1,S2) = ψ(C_q(P(S1)), C_q(P(S2)))
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-scale variants ... P(S)^t ... alternating-diffusion fusion map F_AD(A) = P(S_n) ... P(S_1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

[1]

Similarity of neural network representations revisited

Simon Kornblith et al. “Similarity of neural network representations revisited”. In:Proceedings of the 36th International Conference on Machine Learning. 2019, pp. 3519–3529

work page 2019
[2]

Representational similarity analysis – connecting the branches of systems neuroscience

Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. “Representational similarity analysis – connecting the branches of systems neuroscience”. In:Frontiers in Systems Neuroscience2 (2008), p. 4

work page 2008
[3]

ReSi: A comprehensive benchmark for representational similarity measures

Max Klabunde et al. “ReSi: A comprehensive benchmark for representational similarity measures”. In:International Conference on Learning Representations (ICLR). 2025

work page 2025
[4]

Similarity-Preserving Knowledge Distillation

Frederick Tung and Greg Mori. “Similarity-Preserving Knowledge Distillation”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2019

work page 2019
[5]

Measuring and testing dependence by correlation of distances

G ´abor J. Sz´ekely, Maria L. Rizzo, and Nail K. Bakirov. “Measuring and testing dependence by correlation of distances”. In:The Annals of Statistics35.6 (2007), pp. 2769–2794

work page 2007
[6]

Diffusion maps

Ronald R. Coifman and St ´ephane Lafon. “Diffusion maps”. In:Applied and Computational Harmonic Analysis21.1 (2006), pp. 5–30

work page 2006
[7]

Chang Xu, Dacheng Tao, and Chao Xu.A survey on multi-view learning. 2013. arXiv:1304.5634

work page internal anchor Pith review Pith/arXiv arXiv 2013
[8]

Learning the geometry of common latent variables using alternating-diffusion

Roy R. Lederman and Ronen Talmon. “Learning the geometry of common latent variables using alternating-diffusion”. In:Applied and Computational Harmonic Analysis44.3 (2018), pp. 509–536

work page 2018
[9]

Grounding representation similarity with statistical testing

Frances Ding, Jean-Stanislas Denain, and Jacob Steinhardt. “Grounding representation similarity with statistical testing”. In:Advances in Neural Information Processing Systems. 2021

work page 2021
[10]

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

Maithra Raghu et al. “SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and inter- pretability”. In:Advances in Neural Information Processing Systems. 2017. arXiv:1706.05806

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Insights on representational similarity in neural networks with canonical correlation

Ari S. Morcos, Maithra Raghu, and Samy Bengio. “Insights on representational similarity in neural networks with canonical correlation”. In:Advances in Neural Information Processing Systems. 2018. arXiv:1806.05759

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

On the versatile uses of partial distance correlation in deep learning

Xingjian Zhen et al. “On the versatile uses of partial distance correlation in deep learning”. In:European Conference on Computer Vision (ECCV). 2022. arXiv:2207.09684

work page arXiv 2022
[13]

Similarity of neural network models: A survey of functional and representational measures

Max Klabunde et al. “Similarity of neural network models: A survey of functional and representational measures”. In: ACM Computing Surveys57.9 (2025), pp. 1–52. arXiv:2305.06329

work page arXiv 2025
[14]

Position: The Platonic Representation Hypothesis

Minyoung Huh et al. “Position: The Platonic Representation Hypothesis”. In:Proceedings of the 41st International Conference on Machine Learning. 2024

work page 2024
[15]

Representational Alignment Across Model Layers and Brain Regions with Multi- Level Optimal Transport

Shaan Shah and Meenakshi Khosla. “Representational Alignment Across Model Layers and Brain Regions with Multi- Level Optimal Transport”. In:International Conference on Learning Representations (ICLR). 2026

work page 2026
[16]

Evaluating representational similarity measures from the lens of functional correspondence

Yiqing Bo et al. “Evaluating representational similarity measures from the lens of functional correspondence”. In:arXiv preprint arXiv:2411.14633(2024). arXiv:2411.14633

work page arXiv 2024
[17]

Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps

Ronald R. Coifman et al. “Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps”. In:Proceedings of the National Academy of Sciences102.21 (2005), pp. 7426–7431

work page 2005
[18]

Fractional diffusion maps

Harbir Antil, Tyrus Berry, and John Harlim. “Fractional diffusion maps”. In:Applied and Computational Harmonic Analysis54 (2021), pp. 145–175

work page 2021
[19]

Alternating diffusion for common manifold learning with application to sleep stage assessment

Roy R. Lederman et al. “Alternating diffusion for common manifold learning with application to sleep stage assessment”. In:2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015, pp. 5758–5762

work page 2015
[20]

Latent common manifold learning with alternating diffusion: analysis and applications

Ronen Talmon and Hau-Tieng Wu. “Latent common manifold learning with alternating diffusion: Analysis and appli- cations”. In:Applied and Computational Harmonic Analysis47.3 (2019), pp. 848–892. arXiv:1602.00078

work page internal anchor Pith review Pith/arXiv arXiv 2019
[21]

Alternating diffusion maps for multimodal data fusion

Ori Katz et al. “Alternating diffusion maps for multimodal data fusion”. In:Information Fusion45 (2019), pp. 346–360

work page 2019
[22]

Audio-visual source separation with alternating diffusion maps

David Dov, Ronen Talmon, and Israel Cohen. “Audio-visual source separation with alternating diffusion maps”. In: Audio Source Separation. Ed. by Shoji Makino. Signals and Communication Technology. Springer, 2018

work page 2018
[23]

Alternating Diffusion Map Based Fusion of Multimodal Brain Connectivity Networks for IQ Prediction

Li Xiao et al. “Alternating diffusion map based fusion of multimodal brain connectivity networks for IQ prediction”. In:IEEE Transactions on Biomedical Engineering66.8 (2019), pp. 2140–2151. arXiv:1810.12954

work page internal anchor Pith review Pith/arXiv arXiv 2019
[24]

A Survey of Multi-View Representation Learning

Yingming Li, Ming Yang, and Zhongfei Zhang. “A survey of multi-view representation learning”. In:IEEE Transactions on Knowledge and Data Engineering31.10 (2019), pp. 1863–1883. arXiv:1610.01206

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

A survey of multi-view machine learning

Shiliang Sun. “A survey of multi-view machine learning”. In:Neural Computing and Applications23.7–8 (2013), pp. 2031–2038

work page 2013
[26]

Visualizing the PHATE of neural networks

Scott Gigante et al. “Visualizing the PHATE of neural networks”. In:Advances in Neural Information Processing Systems. 2019. arXiv:1908.02831

work page arXiv 2019
[27]

Visualizing structure and transitions in high-dimensional biological data

Kevin R. Moon et al. “Visualizing structure and transitions in high-dimensional biological data”. In:Nature Biotech- nology37.12 (2019), pp. 1482–1492. 12

work page 2019
[28]

Elliott Abel et al.Exploring the manifold of neural networks using diffusion geometry. 2024. arXiv:2411.12626

work page arXiv 2024
[29]

Neural FIM for learning Fisher information metrics from point cloud data

Oluwadamilola Fasina et al. “Neural FIM for learning Fisher information metrics from point cloud data”. In:Proceedings of the 40th International Conference on Machine Learning. 2023, pp. 9814–9826. arXiv:2306.06062

work page arXiv 2023
[30]

Measuring statistical dependence with Hilbert-Schmidt norms

Arthur Gretton et al. “Measuring statistical dependence with Hilbert-Schmidt norms”. In:Algorithmic Learning Theory: 16th International Conference. 2005, pp. 63–77

work page 2005
[31]

Representation topology divergence: A method for comparing neural network representa- tions

Serguei Barannikov et al. “Representation topology divergence: A method for comparing neural network representa- tions”. In:Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022, pp. 1607–1626. arXiv:2201.00058

work page arXiv 2022
[32]

David P ´erez-Fern´andez et al.Characterizing and measuring the similarity of neural networks with persistent homology

work page
[33]

The Shape of Data: Intrinsic Distance for Data Distributions

Anton Tsitsulin et al. “The Shape of Data: Intrinsic Distance for Data Distributions”. In:International Conference on Learning Representations (ICLR). 2020

work page 2020
[34]

Diachronic word embeddings reveal statistical laws of semantic change

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. “Diachronic word embeddings reveal statistical laws of semantic change”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016

work page 2016
[35]

Generalized shape metrics on neural representations

Alex H. Williams et al. “Generalized shape metrics on neural representations”. In:Advances in Neural Information Processing Systems. 2021

work page 2021
[36]

Convergent learning: Do different neural networks learn the same representations?

Yixuan Li et al. “Convergent learning: Do different neural networks learn the same representations?” In:Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015. 2015

work page 2015
[37]

Cultural shift or linguistic drift? Comparing two computational measures of semantic change

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. “Cultural shift or linguistic drift? Comparing two computational measures of semantic change”. In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016

work page 2016
[38]

Towards understanding the instability of network embedding

Chenxu Wang et al. “Towards understanding the instability of network embedding”. In:IEEE Transactions on Knowledge and Data Engineering34.2 (2022), pp. 927–941

work page 2022
[39]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. “Understanding contrastive representation learning through alignment and uniformity on the hypersphere”. In:Proceedings of the 37th International Conference on Machine Learning. 2020

work page 2020
[40]

On the downstream performance of compressed word embeddings

Avner May et al. “On the downstream performance of compressed word embeddings”. In:Advances in Neural Informa- tion Processing Systems. 2019

work page 2019
[41]

GULP: a prediction-based metric between representations

Enric Boix-Adsera et al. “GULP: a prediction-based metric between representations”. In:Advances in Neural Information Processing Systems. 2022

work page 2022
[42]

On the dimensionality of word embedding

Zi Yin and Yuanyuan Shen. “On the dimensionality of word embedding”. In:Advances in Neural Information Processing Systems. 2018

work page 2018
[43]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In:Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019

work page 2019
[44]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal et al. “SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model”. In: arXiv preprint arXiv:2502.02737(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

Richard Socher et al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In:Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2013

work page 2013
[46]

Deep Residual Learning for Image Recognition

Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

work page 2016
[47]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: International Conference on Learning Representations (ICLR). 2015

work page 2015
[48]

ImageNet: A Large-Scale Hierarchical Image Database

Jia Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2009

work page 2009
[49]

Stress test evaluation for natural language inference

Aakanksha Naik et al. “Stress test evaluation for natural language inference”. In:Proceedings of the 27th International Conference on Computational Linguistics. 2018

work page 2018
[50]

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

R. Thomas McCoy, Junghyun Min, and Tal Linzen. “BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance”. In:Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Ed. by Afra Alishahi et al. Online: Association for Computational Lingui...

work page doi:10.18653/v1/2020.blackboxnlp-1.21 2020

[1] [1]

Similarity of neural network representations revisited

Simon Kornblith et al. “Similarity of neural network representations revisited”. In:Proceedings of the 36th International Conference on Machine Learning. 2019, pp. 3519–3529

work page 2019

[2] [2]

Representational similarity analysis – connecting the branches of systems neuroscience

Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. “Representational similarity analysis – connecting the branches of systems neuroscience”. In:Frontiers in Systems Neuroscience2 (2008), p. 4

work page 2008

[3] [3]

ReSi: A comprehensive benchmark for representational similarity measures

Max Klabunde et al. “ReSi: A comprehensive benchmark for representational similarity measures”. In:International Conference on Learning Representations (ICLR). 2025

work page 2025

[4] [4]

Similarity-Preserving Knowledge Distillation

Frederick Tung and Greg Mori. “Similarity-Preserving Knowledge Distillation”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2019

work page 2019

[5] [5]

Measuring and testing dependence by correlation of distances

G ´abor J. Sz´ekely, Maria L. Rizzo, and Nail K. Bakirov. “Measuring and testing dependence by correlation of distances”. In:The Annals of Statistics35.6 (2007), pp. 2769–2794

work page 2007

[6] [6]

Diffusion maps

Ronald R. Coifman and St ´ephane Lafon. “Diffusion maps”. In:Applied and Computational Harmonic Analysis21.1 (2006), pp. 5–30

work page 2006

[7] [7]

Chang Xu, Dacheng Tao, and Chao Xu.A survey on multi-view learning. 2013. arXiv:1304.5634

work page internal anchor Pith review Pith/arXiv arXiv 2013

[8] [8]

Learning the geometry of common latent variables using alternating-diffusion

Roy R. Lederman and Ronen Talmon. “Learning the geometry of common latent variables using alternating-diffusion”. In:Applied and Computational Harmonic Analysis44.3 (2018), pp. 509–536

work page 2018

[9] [9]

Grounding representation similarity with statistical testing

Frances Ding, Jean-Stanislas Denain, and Jacob Steinhardt. “Grounding representation similarity with statistical testing”. In:Advances in Neural Information Processing Systems. 2021

work page 2021

[10] [10]

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

Maithra Raghu et al. “SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and inter- pretability”. In:Advances in Neural Information Processing Systems. 2017. arXiv:1706.05806

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Insights on representational similarity in neural networks with canonical correlation

Ari S. Morcos, Maithra Raghu, and Samy Bengio. “Insights on representational similarity in neural networks with canonical correlation”. In:Advances in Neural Information Processing Systems. 2018. arXiv:1806.05759

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

On the versatile uses of partial distance correlation in deep learning

Xingjian Zhen et al. “On the versatile uses of partial distance correlation in deep learning”. In:European Conference on Computer Vision (ECCV). 2022. arXiv:2207.09684

work page arXiv 2022

[13] [13]

Similarity of neural network models: A survey of functional and representational measures

Max Klabunde et al. “Similarity of neural network models: A survey of functional and representational measures”. In: ACM Computing Surveys57.9 (2025), pp. 1–52. arXiv:2305.06329

work page arXiv 2025

[14] [14]

Position: The Platonic Representation Hypothesis

Minyoung Huh et al. “Position: The Platonic Representation Hypothesis”. In:Proceedings of the 41st International Conference on Machine Learning. 2024

work page 2024

[15] [15]

Representational Alignment Across Model Layers and Brain Regions with Multi- Level Optimal Transport

Shaan Shah and Meenakshi Khosla. “Representational Alignment Across Model Layers and Brain Regions with Multi- Level Optimal Transport”. In:International Conference on Learning Representations (ICLR). 2026

work page 2026

[16] [16]

Evaluating representational similarity measures from the lens of functional correspondence

Yiqing Bo et al. “Evaluating representational similarity measures from the lens of functional correspondence”. In:arXiv preprint arXiv:2411.14633(2024). arXiv:2411.14633

work page arXiv 2024

[17] [17]

Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps

Ronald R. Coifman et al. “Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps”. In:Proceedings of the National Academy of Sciences102.21 (2005), pp. 7426–7431

work page 2005

[18] [18]

Fractional diffusion maps

Harbir Antil, Tyrus Berry, and John Harlim. “Fractional diffusion maps”. In:Applied and Computational Harmonic Analysis54 (2021), pp. 145–175

work page 2021

[19] [19]

Alternating diffusion for common manifold learning with application to sleep stage assessment

Roy R. Lederman et al. “Alternating diffusion for common manifold learning with application to sleep stage assessment”. In:2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015, pp. 5758–5762

work page 2015

[20] [20]

Latent common manifold learning with alternating diffusion: analysis and applications

Ronen Talmon and Hau-Tieng Wu. “Latent common manifold learning with alternating diffusion: Analysis and appli- cations”. In:Applied and Computational Harmonic Analysis47.3 (2019), pp. 848–892. arXiv:1602.00078

work page internal anchor Pith review Pith/arXiv arXiv 2019

[21] [21]

Alternating diffusion maps for multimodal data fusion

Ori Katz et al. “Alternating diffusion maps for multimodal data fusion”. In:Information Fusion45 (2019), pp. 346–360

work page 2019

[22] [22]

Audio-visual source separation with alternating diffusion maps

David Dov, Ronen Talmon, and Israel Cohen. “Audio-visual source separation with alternating diffusion maps”. In: Audio Source Separation. Ed. by Shoji Makino. Signals and Communication Technology. Springer, 2018

work page 2018

[23] [23]

Alternating Diffusion Map Based Fusion of Multimodal Brain Connectivity Networks for IQ Prediction

Li Xiao et al. “Alternating diffusion map based fusion of multimodal brain connectivity networks for IQ prediction”. In:IEEE Transactions on Biomedical Engineering66.8 (2019), pp. 2140–2151. arXiv:1810.12954

work page internal anchor Pith review Pith/arXiv arXiv 2019

[24] [24]

A Survey of Multi-View Representation Learning

Yingming Li, Ming Yang, and Zhongfei Zhang. “A survey of multi-view representation learning”. In:IEEE Transactions on Knowledge and Data Engineering31.10 (2019), pp. 1863–1883. arXiv:1610.01206

work page internal anchor Pith review Pith/arXiv arXiv 2019

[25] [25]

A survey of multi-view machine learning

Shiliang Sun. “A survey of multi-view machine learning”. In:Neural Computing and Applications23.7–8 (2013), pp. 2031–2038

work page 2013

[26] [26]

Visualizing the PHATE of neural networks

Scott Gigante et al. “Visualizing the PHATE of neural networks”. In:Advances in Neural Information Processing Systems. 2019. arXiv:1908.02831

work page arXiv 2019

[27] [27]

Visualizing structure and transitions in high-dimensional biological data

Kevin R. Moon et al. “Visualizing structure and transitions in high-dimensional biological data”. In:Nature Biotech- nology37.12 (2019), pp. 1482–1492. 12

work page 2019

[28] [28]

Elliott Abel et al.Exploring the manifold of neural networks using diffusion geometry. 2024. arXiv:2411.12626

work page arXiv 2024

[29] [29]

Neural FIM for learning Fisher information metrics from point cloud data

Oluwadamilola Fasina et al. “Neural FIM for learning Fisher information metrics from point cloud data”. In:Proceedings of the 40th International Conference on Machine Learning. 2023, pp. 9814–9826. arXiv:2306.06062

work page arXiv 2023

[30] [30]

Measuring statistical dependence with Hilbert-Schmidt norms

Arthur Gretton et al. “Measuring statistical dependence with Hilbert-Schmidt norms”. In:Algorithmic Learning Theory: 16th International Conference. 2005, pp. 63–77

work page 2005

[31] [31]

Representation topology divergence: A method for comparing neural network representa- tions

Serguei Barannikov et al. “Representation topology divergence: A method for comparing neural network representa- tions”. In:Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022, pp. 1607–1626. arXiv:2201.00058

work page arXiv 2022

[32] [32]

David P ´erez-Fern´andez et al.Characterizing and measuring the similarity of neural networks with persistent homology

work page

[33] [33]

The Shape of Data: Intrinsic Distance for Data Distributions

Anton Tsitsulin et al. “The Shape of Data: Intrinsic Distance for Data Distributions”. In:International Conference on Learning Representations (ICLR). 2020

work page 2020

[34] [34]

Diachronic word embeddings reveal statistical laws of semantic change

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. “Diachronic word embeddings reveal statistical laws of semantic change”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016

work page 2016

[35] [35]

Generalized shape metrics on neural representations

Alex H. Williams et al. “Generalized shape metrics on neural representations”. In:Advances in Neural Information Processing Systems. 2021

work page 2021

[36] [36]

Convergent learning: Do different neural networks learn the same representations?

Yixuan Li et al. “Convergent learning: Do different neural networks learn the same representations?” In:Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015. 2015

work page 2015

[37] [37]

Cultural shift or linguistic drift? Comparing two computational measures of semantic change

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. “Cultural shift or linguistic drift? Comparing two computational measures of semantic change”. In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016

work page 2016

[38] [38]

Towards understanding the instability of network embedding

Chenxu Wang et al. “Towards understanding the instability of network embedding”. In:IEEE Transactions on Knowledge and Data Engineering34.2 (2022), pp. 927–941

work page 2022

[39] [39]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. “Understanding contrastive representation learning through alignment and uniformity on the hypersphere”. In:Proceedings of the 37th International Conference on Machine Learning. 2020

work page 2020

[40] [40]

On the downstream performance of compressed word embeddings

Avner May et al. “On the downstream performance of compressed word embeddings”. In:Advances in Neural Informa- tion Processing Systems. 2019

work page 2019

[41] [41]

GULP: a prediction-based metric between representations

Enric Boix-Adsera et al. “GULP: a prediction-based metric between representations”. In:Advances in Neural Information Processing Systems. 2022

work page 2022

[42] [42]

On the dimensionality of word embedding

Zi Yin and Yuanyuan Shen. “On the dimensionality of word embedding”. In:Advances in Neural Information Processing Systems. 2018

work page 2018

[43] [43]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In:Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019

work page 2019

[44] [44]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal et al. “SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model”. In: arXiv preprint arXiv:2502.02737(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

Richard Socher et al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In:Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2013

work page 2013

[46] [46]

Deep Residual Learning for Image Recognition

Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

work page 2016

[47] [47]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: International Conference on Learning Representations (ICLR). 2015

work page 2015

[48] [48]

ImageNet: A Large-Scale Hierarchical Image Database

Jia Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2009

work page 2009

[49] [49]

Stress test evaluation for natural language inference

Aakanksha Naik et al. “Stress test evaluation for natural language inference”. In:Proceedings of the 27th International Conference on Computational Linguistics. 2018

work page 2018

[50] [50]

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

R. Thomas McCoy, Junghyun Min, and Tal Linzen. “BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance”. In:Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Ed. by Afra Alishahi et al. Online: Association for Computational Lingui...

work page doi:10.18653/v1/2020.blackboxnlp-1.21 2020