From Layers to Networks: Comparing Neural Representations via Diffusion Geometry
Pith reviewed 2026-05-20 19:58 UTC · model grok-4.3
The pith
Representational similarity matrices admit an exact rewrite as row-stochastic Markov matrices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a broad class of similarity measures based on representational similarity matrices admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices. This equivalence permits multi-scale variants of measures such as Centered Kernel Alignment and Distance Correlation that use the t-th power of the transition matrix to examine geometry at chosen diffusion scales. It also supports fused variants in which the Markov matrices of multiple layers are merged via alternating diffusion into a single operator that captures the network's joint sample geometry, thereby shifting the comparison task from layer-to-layer to network-to-network.
What carries the argument
The row-stochastic Markov matrix obtained directly from a representational similarity matrix, serving as the transition operator that encodes geometry for diffusion at adjustable scales.
If this is right
- Existing similarity measures acquire multi-scale versions that examine representation geometry at chosen diffusion times by matrix powering.
- Multiple layers can be fused into one operator that encodes their joint sample geometry for a single network-level comparison.
- Comparisons move from matching individual layers to matching entire networks as unified objects.
- The same operators apply unchanged to out-of-distribution data without requiring new similarity definitions.
Where Pith is reading between the lines
- The Markov reformulation could be checked on similarity measures outside the two explicitly extended in the work.
- Fused operators might be used to track how geometric structure evolves with network depth in a single computation.
- The approach suggests a route to comparing networks trained on different tasks by aligning their combined diffusion geometries rather than single-layer snapshots.
Load-bearing premise
Converting a representational similarity matrix into its row-stochastic Markov form preserves the geometric relations needed for meaningful multi-scale and cross-layer comparisons.
What would settle it
If the multi-scale or fused Markov-based measures produce systematically lower agreement with human or task-based judgments than the original similarity measures when tested on networks whose representational differences are independently known, the claim that the conversion preserves essential geometry would be refuted.
read the original abstract
Diffusion geometry is a manifold learning framework that uses random walks defined by Markov transition matrices to characterize the geometry of a dataset at multiple scales. We use diffusion geometry for neural representations, incorporating tools from multi-view learning into this field for the first time. Our key technical observation is that a broad class of similarity measures based on representational similarity matrices (RSMs) admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices, opening the door to manipulations from diffusion geometry. As a first application, we develop multi-scale variants of Centered Kernel Alignment and Distance Correlation, which utilise the $t^{th}$ power of the underlying transition matrix to probe the data geometry at adjustable diffusion scales. Going further, we introduce variants of these measures which fuse the Markov matrices of several layers via alternating diffusion into a single operator that captures the network's joint sample geometry, allowing similarity to be computed across multiple layers and shifting the comparison from layer-to-layer to network-to-network. We perform extensive numerical experiments, evaluating our measures on the Representational Similarity (ReSi) benchmark comprising 14 architectures trained on 7 datasets across three different domains. Our methods achieve SoTA results in accuracy and output correlation for both language and vision tasks across different models. We furthermore show SoTA performance on an additional benchmark evaluating on out-of-distribution data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a broad class of similarity measures based on representational similarity matrices (RSMs) admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices. This enables multi-scale variants of Centered Kernel Alignment (CKA) and Distance Correlation using the t-th power of the transition matrix to probe geometry at adjustable scales, and layer fusion via alternating diffusion into a single operator for network-to-network comparisons. Experiments on the ReSi benchmark (14 architectures, 7 datasets, 3 domains) report state-of-the-art accuracy and output correlation for vision and language tasks, plus strong performance on out-of-distribution data.
Significance. If the equivalence derivation is exact and the row-stochastic normalization preserves task-relevant geometric structure, the framework would provide a principled way to extend RSM-based measures to multi-scale analysis and cross-network fusion, potentially improving robustness over single-scale CKA or DCor. The reported SoTA results on a broad benchmark suggest empirical utility if the claims are reproducible and the normalization artifacts are shown to be negligible.
major comments (3)
- [§3] §3 (Derivation of equivalence): The central claim that RSM similarities admit an exact closed-form rewrite as row-stochastic Markov matrix P (via row-normalization of S) must explicitly demonstrate that this step does not discard sample-degree or density information that encodes task-relevant geometry; otherwise the subsequent multi-scale (P^t) and alternating-diffusion constructions rest on an unverified assumption.
- [§5] §5 (Experimental results): The SoTA claims on the ReSi benchmark and OOD evaluation provide no error bars, statistical significance tests, or data-exclusion criteria, making it impossible to assess whether the reported gains over baselines are robust or driven by specific dataset/model subsets.
- [§4.2] §4.2 (Alternating diffusion fusion): The construction of the fused operator from multiple layer Markov matrices requires a precise definition of the alternating diffusion procedure and a proof or empirical check that the resulting joint geometry remains faithful for cross-architecture comparison.
minor comments (2)
- [Notation] Clarify notation for the transition matrix P and its exact relation to the original similarity matrix S in all equations and pseudocode.
- [Methods] Add a brief discussion of computational complexity for the diffusion powers and alternating diffusion steps when applied to large representation matrices.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Derivation of equivalence): The central claim that RSM similarities admit an exact closed-form rewrite as row-stochastic Markov matrix P (via row-normalization of S) must explicitly demonstrate that this step does not discard sample-degree or density information that encodes task-relevant geometry; otherwise the subsequent multi-scale (P^t) and alternating-diffusion constructions rest on an unverified assumption.
Authors: We agree that explicit verification is warranted. In the revision we will expand §3 with a derivation showing that row-normalization P = D^{-1}S (D the diagonal degree matrix) is the standard construction from diffusion maps, which incorporates local sample density rather than discarding it. The resulting transition probabilities preserve intrinsic geometry while correcting for non-uniform sampling; we will include a brief analytic argument and a low-dimensional example confirming that cluster structure relevant to downstream tasks is retained under powering. revision: yes
-
Referee: [§5] §5 (Experimental results): The SoTA claims on the ReSi benchmark and OOD evaluation provide no error bars, statistical significance tests, or data-exclusion criteria, making it impossible to assess whether the reported gains over baselines are robust or driven by specific dataset/model subsets.
Authors: The referee correctly identifies a presentational gap. We will recompute all ReSi and OOD results over multiple random seeds, report means with standard-deviation error bars, add paired statistical tests (t-tests or Wilcoxon signed-rank) against baselines, and state explicitly that the full set of 14 architectures, 7 datasets, and all OOD splits were retained without exclusion. revision: yes
-
Referee: [§4.2] §4.2 (Alternating diffusion fusion): The construction of the fused operator from multiple layer Markov matrices requires a precise definition of the alternating diffusion procedure and a proof or empirical check that the resulting joint geometry remains faithful for cross-architecture comparison.
Authors: We will rewrite §4.2 to give an algorithmic definition of the alternating-diffusion fusion (iterated application of layer-specific transition matrices with normalization at each step). In addition we will insert a controlled empirical check on synthetic data with known joint geometry, demonstrating that the fused operator recovers cross-layer correspondences more accurately than single-layer baselines and introduces no systematic distortion for cross-architecture use. revision: yes
Circularity Check
No circularity: equivalence is a direct mathematical rewriting, not a reduction by construction
full rationale
The paper's central technical step is the observation that RSM-based similarities admit a closed-form rewrite using row-stochastic Markov matrices obtained via standard row-normalization. This is a definitional transformation (P = D^{-1}S or equivalent) that enables subsequent diffusion operators such as P^t and alternating diffusion; it does not derive a new quantity from fitted parameters or self-citations that then loops back to the original claim. The multi-scale CKA/DCor variants and network-to-network fusion are explicit extensions built on this rewrite, with the paper providing empirical validation on the ReSi benchmark rather than relying on tautological equivalence. No self-definitional loop, fitted-input prediction, or load-bearing self-citation is present in the derivation chain. The row-normalization step is a standard construction whose information-loss properties are a separate validity question, not evidence of circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Similarity measures based on representational similarity matrices admit a closed-form equivalent formulation in terms of row-stochastic Markov matrices.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 ... P(S) = 1q^T + α(S) C_q(S) ... m_RSM(S1,S2) = ψ(C_q(P(S1)), C_q(P(S2)))
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-scale variants ... P(S)^t ... alternating-diffusion fusion map F_AD(A) = P(S_n) ... P(S_1)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Similarity of neural network representations revisited
Simon Kornblith et al. “Similarity of neural network representations revisited”. In:Proceedings of the 36th International Conference on Machine Learning. 2019, pp. 3519–3529
work page 2019
-
[2]
Representational similarity analysis – connecting the branches of systems neuroscience
Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. “Representational similarity analysis – connecting the branches of systems neuroscience”. In:Frontiers in Systems Neuroscience2 (2008), p. 4
work page 2008
-
[3]
ReSi: A comprehensive benchmark for representational similarity measures
Max Klabunde et al. “ReSi: A comprehensive benchmark for representational similarity measures”. In:International Conference on Learning Representations (ICLR). 2025
work page 2025
-
[4]
Similarity-Preserving Knowledge Distillation
Frederick Tung and Greg Mori. “Similarity-Preserving Knowledge Distillation”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2019
work page 2019
-
[5]
Measuring and testing dependence by correlation of distances
G ´abor J. Sz´ekely, Maria L. Rizzo, and Nail K. Bakirov. “Measuring and testing dependence by correlation of distances”. In:The Annals of Statistics35.6 (2007), pp. 2769–2794
work page 2007
-
[6]
Ronald R. Coifman and St ´ephane Lafon. “Diffusion maps”. In:Applied and Computational Harmonic Analysis21.1 (2006), pp. 5–30
work page 2006
-
[7]
Chang Xu, Dacheng Tao, and Chao Xu.A survey on multi-view learning. 2013. arXiv:1304.5634
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[8]
Learning the geometry of common latent variables using alternating-diffusion
Roy R. Lederman and Ronen Talmon. “Learning the geometry of common latent variables using alternating-diffusion”. In:Applied and Computational Harmonic Analysis44.3 (2018), pp. 509–536
work page 2018
-
[9]
Grounding representation similarity with statistical testing
Frances Ding, Jean-Stanislas Denain, and Jacob Steinhardt. “Grounding representation similarity with statistical testing”. In:Advances in Neural Information Processing Systems. 2021
work page 2021
-
[10]
Maithra Raghu et al. “SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and inter- pretability”. In:Advances in Neural Information Processing Systems. 2017. arXiv:1706.05806
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Insights on representational similarity in neural networks with canonical correlation
Ari S. Morcos, Maithra Raghu, and Samy Bengio. “Insights on representational similarity in neural networks with canonical correlation”. In:Advances in Neural Information Processing Systems. 2018. arXiv:1806.05759
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
On the versatile uses of partial distance correlation in deep learning
Xingjian Zhen et al. “On the versatile uses of partial distance correlation in deep learning”. In:European Conference on Computer Vision (ECCV). 2022. arXiv:2207.09684
-
[13]
Similarity of neural network models: A survey of functional and representational measures
Max Klabunde et al. “Similarity of neural network models: A survey of functional and representational measures”. In: ACM Computing Surveys57.9 (2025), pp. 1–52. arXiv:2305.06329
-
[14]
Position: The Platonic Representation Hypothesis
Minyoung Huh et al. “Position: The Platonic Representation Hypothesis”. In:Proceedings of the 41st International Conference on Machine Learning. 2024
work page 2024
-
[15]
Representational Alignment Across Model Layers and Brain Regions with Multi- Level Optimal Transport
Shaan Shah and Meenakshi Khosla. “Representational Alignment Across Model Layers and Brain Regions with Multi- Level Optimal Transport”. In:International Conference on Learning Representations (ICLR). 2026
work page 2026
-
[16]
Evaluating representational similarity measures from the lens of functional correspondence
Yiqing Bo et al. “Evaluating representational similarity measures from the lens of functional correspondence”. In:arXiv preprint arXiv:2411.14633(2024). arXiv:2411.14633
-
[17]
Ronald R. Coifman et al. “Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps”. In:Proceedings of the National Academy of Sciences102.21 (2005), pp. 7426–7431
work page 2005
-
[18]
Harbir Antil, Tyrus Berry, and John Harlim. “Fractional diffusion maps”. In:Applied and Computational Harmonic Analysis54 (2021), pp. 145–175
work page 2021
-
[19]
Alternating diffusion for common manifold learning with application to sleep stage assessment
Roy R. Lederman et al. “Alternating diffusion for common manifold learning with application to sleep stage assessment”. In:2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015, pp. 5758–5762
work page 2015
-
[20]
Latent common manifold learning with alternating diffusion: analysis and applications
Ronen Talmon and Hau-Tieng Wu. “Latent common manifold learning with alternating diffusion: Analysis and appli- cations”. In:Applied and Computational Harmonic Analysis47.3 (2019), pp. 848–892. arXiv:1602.00078
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[21]
Alternating diffusion maps for multimodal data fusion
Ori Katz et al. “Alternating diffusion maps for multimodal data fusion”. In:Information Fusion45 (2019), pp. 346–360
work page 2019
-
[22]
Audio-visual source separation with alternating diffusion maps
David Dov, Ronen Talmon, and Israel Cohen. “Audio-visual source separation with alternating diffusion maps”. In: Audio Source Separation. Ed. by Shoji Makino. Signals and Communication Technology. Springer, 2018
work page 2018
-
[23]
Alternating Diffusion Map Based Fusion of Multimodal Brain Connectivity Networks for IQ Prediction
Li Xiao et al. “Alternating diffusion map based fusion of multimodal brain connectivity networks for IQ prediction”. In:IEEE Transactions on Biomedical Engineering66.8 (2019), pp. 2140–2151. arXiv:1810.12954
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
A Survey of Multi-View Representation Learning
Yingming Li, Ming Yang, and Zhongfei Zhang. “A survey of multi-view representation learning”. In:IEEE Transactions on Knowledge and Data Engineering31.10 (2019), pp. 1863–1883. arXiv:1610.01206
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[25]
A survey of multi-view machine learning
Shiliang Sun. “A survey of multi-view machine learning”. In:Neural Computing and Applications23.7–8 (2013), pp. 2031–2038
work page 2013
-
[26]
Visualizing the PHATE of neural networks
Scott Gigante et al. “Visualizing the PHATE of neural networks”. In:Advances in Neural Information Processing Systems. 2019. arXiv:1908.02831
-
[27]
Visualizing structure and transitions in high-dimensional biological data
Kevin R. Moon et al. “Visualizing structure and transitions in high-dimensional biological data”. In:Nature Biotech- nology37.12 (2019), pp. 1482–1492. 12
work page 2019
- [28]
-
[29]
Neural FIM for learning Fisher information metrics from point cloud data
Oluwadamilola Fasina et al. “Neural FIM for learning Fisher information metrics from point cloud data”. In:Proceedings of the 40th International Conference on Machine Learning. 2023, pp. 9814–9826. arXiv:2306.06062
-
[30]
Measuring statistical dependence with Hilbert-Schmidt norms
Arthur Gretton et al. “Measuring statistical dependence with Hilbert-Schmidt norms”. In:Algorithmic Learning Theory: 16th International Conference. 2005, pp. 63–77
work page 2005
-
[31]
Representation topology divergence: A method for comparing neural network representa- tions
Serguei Barannikov et al. “Representation topology divergence: A method for comparing neural network representa- tions”. In:Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022, pp. 1607–1626. arXiv:2201.00058
-
[32]
David P ´erez-Fern´andez et al.Characterizing and measuring the similarity of neural networks with persistent homology
-
[33]
The Shape of Data: Intrinsic Distance for Data Distributions
Anton Tsitsulin et al. “The Shape of Data: Intrinsic Distance for Data Distributions”. In:International Conference on Learning Representations (ICLR). 2020
work page 2020
-
[34]
Diachronic word embeddings reveal statistical laws of semantic change
William L. Hamilton, Jure Leskovec, and Dan Jurafsky. “Diachronic word embeddings reveal statistical laws of semantic change”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016
work page 2016
-
[35]
Generalized shape metrics on neural representations
Alex H. Williams et al. “Generalized shape metrics on neural representations”. In:Advances in Neural Information Processing Systems. 2021
work page 2021
-
[36]
Convergent learning: Do different neural networks learn the same representations?
Yixuan Li et al. “Convergent learning: Do different neural networks learn the same representations?” In:Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015. 2015
work page 2015
-
[37]
Cultural shift or linguistic drift? Comparing two computational measures of semantic change
William L. Hamilton, Jure Leskovec, and Dan Jurafsky. “Cultural shift or linguistic drift? Comparing two computational measures of semantic change”. In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016
work page 2016
-
[38]
Towards understanding the instability of network embedding
Chenxu Wang et al. “Towards understanding the instability of network embedding”. In:IEEE Transactions on Knowledge and Data Engineering34.2 (2022), pp. 927–941
work page 2022
-
[39]
Tongzhou Wang and Phillip Isola. “Understanding contrastive representation learning through alignment and uniformity on the hypersphere”. In:Proceedings of the 37th International Conference on Machine Learning. 2020
work page 2020
-
[40]
On the downstream performance of compressed word embeddings
Avner May et al. “On the downstream performance of compressed word embeddings”. In:Advances in Neural Informa- tion Processing Systems. 2019
work page 2019
-
[41]
GULP: a prediction-based metric between representations
Enric Boix-Adsera et al. “GULP: a prediction-based metric between representations”. In:Advances in Neural Information Processing Systems. 2022
work page 2022
-
[42]
On the dimensionality of word embedding
Zi Yin and Yuanyuan Shen. “On the dimensionality of word embedding”. In:Advances in Neural Information Processing Systems. 2018
work page 2018
-
[43]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In:Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019
work page 2019
-
[44]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Loubna Ben Allal et al. “SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model”. In: arXiv preprint arXiv:2502.02737(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Richard Socher et al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In:Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2013
work page 2013
-
[46]
Deep Residual Learning for Image Recognition
Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016
work page 2016
-
[47]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: International Conference on Learning Representations (ICLR). 2015
work page 2015
-
[48]
ImageNet: A Large-Scale Hierarchical Image Database
Jia Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2009
work page 2009
-
[49]
Stress test evaluation for natural language inference
Aakanksha Naik et al. “Stress test evaluation for natural language inference”. In:Proceedings of the 27th International Conference on Computational Linguistics. 2018
work page 2018
-
[50]
R. Thomas McCoy, Junghyun Min, and Tal Linzen. “BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance”. In:Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Ed. by Afra Alishahi et al. Online: Association for Computational Lingui...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.