pith. sign in

arxiv: 2605.15901 · v1 · pith:Y24ZAXAKnew · submitted 2026-05-15 · 💻 cs.LG

From Layers to Networks: Comparing Neural Representations via Diffusion Geometry

Pith reviewed 2026-05-20 19:58 UTC · model grok-4.3

classification 💻 cs.LG
keywords representational similarity matricesMarkov transition matricesmulti-scale geometrylayer fusionneural network comparisondiffusion operatorscentered kernel alignment
0
0 comments X

The pith

Representational similarity matrices admit an exact rewrite as row-stochastic Markov matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a wide range of measures for comparing neural representations, built on pairwise similarity matrices, possess an exact alternative expression using row-stochastic Markov matrices. These matrices encode random-walk transitions and thereby let the same measures examine representation geometry at different scales simply by raising the matrix to successive powers. The same rewrite further permits the matrices from several layers to be combined through alternating diffusion into one operator that reflects the network's overall sample geometry. A sympathetic reader would care because the change moves comparisons from isolated layers toward whole-network descriptions and supplies adjustable resolution without redesigning the underlying similarity function.

Core claim

The central claim is that a broad class of similarity measures based on representational similarity matrices admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices. This equivalence permits multi-scale variants of measures such as Centered Kernel Alignment and Distance Correlation that use the t-th power of the transition matrix to examine geometry at chosen diffusion scales. It also supports fused variants in which the Markov matrices of multiple layers are merged via alternating diffusion into a single operator that captures the network's joint sample geometry, thereby shifting the comparison task from layer-to-layer to network-to-network.

What carries the argument

The row-stochastic Markov matrix obtained directly from a representational similarity matrix, serving as the transition operator that encodes geometry for diffusion at adjustable scales.

If this is right

  • Existing similarity measures acquire multi-scale versions that examine representation geometry at chosen diffusion times by matrix powering.
  • Multiple layers can be fused into one operator that encodes their joint sample geometry for a single network-level comparison.
  • Comparisons move from matching individual layers to matching entire networks as unified objects.
  • The same operators apply unchanged to out-of-distribution data without requiring new similarity definitions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The Markov reformulation could be checked on similarity measures outside the two explicitly extended in the work.
  • Fused operators might be used to track how geometric structure evolves with network depth in a single computation.
  • The approach suggests a route to comparing networks trained on different tasks by aligning their combined diffusion geometries rather than single-layer snapshots.

Load-bearing premise

Converting a representational similarity matrix into its row-stochastic Markov form preserves the geometric relations needed for meaningful multi-scale and cross-layer comparisons.

What would settle it

If the multi-scale or fused Markov-based measures produce systematically lower agreement with human or task-based judgments than the original similarity measures when tested on networks whose representational differences are independently known, the claim that the conversion preserves essential geometry would be refuted.

read the original abstract

Diffusion geometry is a manifold learning framework that uses random walks defined by Markov transition matrices to characterize the geometry of a dataset at multiple scales. We use diffusion geometry for neural representations, incorporating tools from multi-view learning into this field for the first time. Our key technical observation is that a broad class of similarity measures based on representational similarity matrices (RSMs) admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices, opening the door to manipulations from diffusion geometry. As a first application, we develop multi-scale variants of Centered Kernel Alignment and Distance Correlation, which utilise the $t^{th}$ power of the underlying transition matrix to probe the data geometry at adjustable diffusion scales. Going further, we introduce variants of these measures which fuse the Markov matrices of several layers via alternating diffusion into a single operator that captures the network's joint sample geometry, allowing similarity to be computed across multiple layers and shifting the comparison from layer-to-layer to network-to-network. We perform extensive numerical experiments, evaluating our measures on the Representational Similarity (ReSi) benchmark comprising 14 architectures trained on 7 datasets across three different domains. Our methods achieve SoTA results in accuracy and output correlation for both language and vision tasks across different models. We furthermore show SoTA performance on an additional benchmark evaluating on out-of-distribution data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that a broad class of similarity measures based on representational similarity matrices (RSMs) admits a closed-form equivalent formulation in terms of row-stochastic Markov matrices. This enables multi-scale variants of Centered Kernel Alignment (CKA) and Distance Correlation using the t-th power of the transition matrix to probe geometry at adjustable scales, and layer fusion via alternating diffusion into a single operator for network-to-network comparisons. Experiments on the ReSi benchmark (14 architectures, 7 datasets, 3 domains) report state-of-the-art accuracy and output correlation for vision and language tasks, plus strong performance on out-of-distribution data.

Significance. If the equivalence derivation is exact and the row-stochastic normalization preserves task-relevant geometric structure, the framework would provide a principled way to extend RSM-based measures to multi-scale analysis and cross-network fusion, potentially improving robustness over single-scale CKA or DCor. The reported SoTA results on a broad benchmark suggest empirical utility if the claims are reproducible and the normalization artifacts are shown to be negligible.

major comments (3)
  1. [§3] §3 (Derivation of equivalence): The central claim that RSM similarities admit an exact closed-form rewrite as row-stochastic Markov matrix P (via row-normalization of S) must explicitly demonstrate that this step does not discard sample-degree or density information that encodes task-relevant geometry; otherwise the subsequent multi-scale (P^t) and alternating-diffusion constructions rest on an unverified assumption.
  2. [§5] §5 (Experimental results): The SoTA claims on the ReSi benchmark and OOD evaluation provide no error bars, statistical significance tests, or data-exclusion criteria, making it impossible to assess whether the reported gains over baselines are robust or driven by specific dataset/model subsets.
  3. [§4.2] §4.2 (Alternating diffusion fusion): The construction of the fused operator from multiple layer Markov matrices requires a precise definition of the alternating diffusion procedure and a proof or empirical check that the resulting joint geometry remains faithful for cross-architecture comparison.
minor comments (2)
  1. [Notation] Clarify notation for the transition matrix P and its exact relation to the original similarity matrix S in all equations and pseudocode.
  2. [Methods] Add a brief discussion of computational complexity for the diffusion powers and alternating diffusion steps when applied to large representation matrices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Derivation of equivalence): The central claim that RSM similarities admit an exact closed-form rewrite as row-stochastic Markov matrix P (via row-normalization of S) must explicitly demonstrate that this step does not discard sample-degree or density information that encodes task-relevant geometry; otherwise the subsequent multi-scale (P^t) and alternating-diffusion constructions rest on an unverified assumption.

    Authors: We agree that explicit verification is warranted. In the revision we will expand §3 with a derivation showing that row-normalization P = D^{-1}S (D the diagonal degree matrix) is the standard construction from diffusion maps, which incorporates local sample density rather than discarding it. The resulting transition probabilities preserve intrinsic geometry while correcting for non-uniform sampling; we will include a brief analytic argument and a low-dimensional example confirming that cluster structure relevant to downstream tasks is retained under powering. revision: yes

  2. Referee: [§5] §5 (Experimental results): The SoTA claims on the ReSi benchmark and OOD evaluation provide no error bars, statistical significance tests, or data-exclusion criteria, making it impossible to assess whether the reported gains over baselines are robust or driven by specific dataset/model subsets.

    Authors: The referee correctly identifies a presentational gap. We will recompute all ReSi and OOD results over multiple random seeds, report means with standard-deviation error bars, add paired statistical tests (t-tests or Wilcoxon signed-rank) against baselines, and state explicitly that the full set of 14 architectures, 7 datasets, and all OOD splits were retained without exclusion. revision: yes

  3. Referee: [§4.2] §4.2 (Alternating diffusion fusion): The construction of the fused operator from multiple layer Markov matrices requires a precise definition of the alternating diffusion procedure and a proof or empirical check that the resulting joint geometry remains faithful for cross-architecture comparison.

    Authors: We will rewrite §4.2 to give an algorithmic definition of the alternating-diffusion fusion (iterated application of layer-specific transition matrices with normalization at each step). In addition we will insert a controlled empirical check on synthetic data with known joint geometry, demonstrating that the fused operator recovers cross-layer correspondences more accurately than single-layer baselines and introduces no systematic distortion for cross-architecture use. revision: yes

Circularity Check

0 steps flagged

No circularity: equivalence is a direct mathematical rewriting, not a reduction by construction

full rationale

The paper's central technical step is the observation that RSM-based similarities admit a closed-form rewrite using row-stochastic Markov matrices obtained via standard row-normalization. This is a definitional transformation (P = D^{-1}S or equivalent) that enables subsequent diffusion operators such as P^t and alternating diffusion; it does not derive a new quantity from fitted parameters or self-citations that then loops back to the original claim. The multi-scale CKA/DCor variants and network-to-network fusion are explicit extensions built on this rewrite, with the paper providing empirical validation on the ReSi benchmark rather than relying on tautological equivalence. No self-definitional loop, fitted-input prediction, or load-bearing self-citation is present in the derivation chain. The row-normalization step is a standard construction whose information-loss properties are a separate validity question, not evidence of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical equivalence between RSM-based similarities and row-stochastic Markov matrices plus the validity of diffusion powers and alternating diffusion for capturing multi-scale and joint geometry; no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Similarity measures based on representational similarity matrices admit a closed-form equivalent formulation in terms of row-stochastic Markov matrices.
    This equivalence is the key technical observation that enables all subsequent diffusion-geometry manipulations.

pith-pipeline@v0.9.0 · 5765 in / 1274 out tokens · 33691 ms · 2026-05-20T19:58:01.001240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

  1. [1]

    Similarity of neural network representations revisited

    Simon Kornblith et al. “Similarity of neural network representations revisited”. In:Proceedings of the 36th International Conference on Machine Learning. 2019, pp. 3519–3529

  2. [2]

    Representational similarity analysis – connecting the branches of systems neuroscience

    Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. “Representational similarity analysis – connecting the branches of systems neuroscience”. In:Frontiers in Systems Neuroscience2 (2008), p. 4

  3. [3]

    ReSi: A comprehensive benchmark for representational similarity measures

    Max Klabunde et al. “ReSi: A comprehensive benchmark for representational similarity measures”. In:International Conference on Learning Representations (ICLR). 2025

  4. [4]

    Similarity-Preserving Knowledge Distillation

    Frederick Tung and Greg Mori. “Similarity-Preserving Knowledge Distillation”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2019

  5. [5]

    Measuring and testing dependence by correlation of distances

    G ´abor J. Sz´ekely, Maria L. Rizzo, and Nail K. Bakirov. “Measuring and testing dependence by correlation of distances”. In:The Annals of Statistics35.6 (2007), pp. 2769–2794

  6. [6]

    Diffusion maps

    Ronald R. Coifman and St ´ephane Lafon. “Diffusion maps”. In:Applied and Computational Harmonic Analysis21.1 (2006), pp. 5–30

  7. [7]

    Chang Xu, Dacheng Tao, and Chao Xu.A survey on multi-view learning. 2013. arXiv:1304.5634

  8. [8]

    Learning the geometry of common latent variables using alternating-diffusion

    Roy R. Lederman and Ronen Talmon. “Learning the geometry of common latent variables using alternating-diffusion”. In:Applied and Computational Harmonic Analysis44.3 (2018), pp. 509–536

  9. [9]

    Grounding representation similarity with statistical testing

    Frances Ding, Jean-Stanislas Denain, and Jacob Steinhardt. “Grounding representation similarity with statistical testing”. In:Advances in Neural Information Processing Systems. 2021

  10. [10]

    SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

    Maithra Raghu et al. “SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and inter- pretability”. In:Advances in Neural Information Processing Systems. 2017. arXiv:1706.05806

  11. [11]

    Insights on representational similarity in neural networks with canonical correlation

    Ari S. Morcos, Maithra Raghu, and Samy Bengio. “Insights on representational similarity in neural networks with canonical correlation”. In:Advances in Neural Information Processing Systems. 2018. arXiv:1806.05759

  12. [12]

    On the versatile uses of partial distance correlation in deep learning

    Xingjian Zhen et al. “On the versatile uses of partial distance correlation in deep learning”. In:European Conference on Computer Vision (ECCV). 2022. arXiv:2207.09684

  13. [13]

    Similarity of neural network models: A survey of functional and representational measures

    Max Klabunde et al. “Similarity of neural network models: A survey of functional and representational measures”. In: ACM Computing Surveys57.9 (2025), pp. 1–52. arXiv:2305.06329

  14. [14]

    Position: The Platonic Representation Hypothesis

    Minyoung Huh et al. “Position: The Platonic Representation Hypothesis”. In:Proceedings of the 41st International Conference on Machine Learning. 2024

  15. [15]

    Representational Alignment Across Model Layers and Brain Regions with Multi- Level Optimal Transport

    Shaan Shah and Meenakshi Khosla. “Representational Alignment Across Model Layers and Brain Regions with Multi- Level Optimal Transport”. In:International Conference on Learning Representations (ICLR). 2026

  16. [16]

    Evaluating representational similarity measures from the lens of functional correspondence

    Yiqing Bo et al. “Evaluating representational similarity measures from the lens of functional correspondence”. In:arXiv preprint arXiv:2411.14633(2024). arXiv:2411.14633

  17. [17]

    Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps

    Ronald R. Coifman et al. “Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps”. In:Proceedings of the National Academy of Sciences102.21 (2005), pp. 7426–7431

  18. [18]

    Fractional diffusion maps

    Harbir Antil, Tyrus Berry, and John Harlim. “Fractional diffusion maps”. In:Applied and Computational Harmonic Analysis54 (2021), pp. 145–175

  19. [19]

    Alternating diffusion for common manifold learning with application to sleep stage assessment

    Roy R. Lederman et al. “Alternating diffusion for common manifold learning with application to sleep stage assessment”. In:2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015, pp. 5758–5762

  20. [20]

    Latent common manifold learning with alternating diffusion: analysis and applications

    Ronen Talmon and Hau-Tieng Wu. “Latent common manifold learning with alternating diffusion: Analysis and appli- cations”. In:Applied and Computational Harmonic Analysis47.3 (2019), pp. 848–892. arXiv:1602.00078

  21. [21]

    Alternating diffusion maps for multimodal data fusion

    Ori Katz et al. “Alternating diffusion maps for multimodal data fusion”. In:Information Fusion45 (2019), pp. 346–360

  22. [22]

    Audio-visual source separation with alternating diffusion maps

    David Dov, Ronen Talmon, and Israel Cohen. “Audio-visual source separation with alternating diffusion maps”. In: Audio Source Separation. Ed. by Shoji Makino. Signals and Communication Technology. Springer, 2018

  23. [23]

    Alternating Diffusion Map Based Fusion of Multimodal Brain Connectivity Networks for IQ Prediction

    Li Xiao et al. “Alternating diffusion map based fusion of multimodal brain connectivity networks for IQ prediction”. In:IEEE Transactions on Biomedical Engineering66.8 (2019), pp. 2140–2151. arXiv:1810.12954

  24. [24]

    A Survey of Multi-View Representation Learning

    Yingming Li, Ming Yang, and Zhongfei Zhang. “A survey of multi-view representation learning”. In:IEEE Transactions on Knowledge and Data Engineering31.10 (2019), pp. 1863–1883. arXiv:1610.01206

  25. [25]

    A survey of multi-view machine learning

    Shiliang Sun. “A survey of multi-view machine learning”. In:Neural Computing and Applications23.7–8 (2013), pp. 2031–2038

  26. [26]

    Visualizing the PHATE of neural networks

    Scott Gigante et al. “Visualizing the PHATE of neural networks”. In:Advances in Neural Information Processing Systems. 2019. arXiv:1908.02831

  27. [27]

    Visualizing structure and transitions in high-dimensional biological data

    Kevin R. Moon et al. “Visualizing structure and transitions in high-dimensional biological data”. In:Nature Biotech- nology37.12 (2019), pp. 1482–1492. 12

  28. [28]

    Elliott Abel et al.Exploring the manifold of neural networks using diffusion geometry. 2024. arXiv:2411.12626

  29. [29]

    Neural FIM for learning Fisher information metrics from point cloud data

    Oluwadamilola Fasina et al. “Neural FIM for learning Fisher information metrics from point cloud data”. In:Proceedings of the 40th International Conference on Machine Learning. 2023, pp. 9814–9826. arXiv:2306.06062

  30. [30]

    Measuring statistical dependence with Hilbert-Schmidt norms

    Arthur Gretton et al. “Measuring statistical dependence with Hilbert-Schmidt norms”. In:Algorithmic Learning Theory: 16th International Conference. 2005, pp. 63–77

  31. [31]

    Representation topology divergence: A method for comparing neural network representa- tions

    Serguei Barannikov et al. “Representation topology divergence: A method for comparing neural network representa- tions”. In:Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022, pp. 1607–1626. arXiv:2201.00058

  32. [32]

    David P ´erez-Fern´andez et al.Characterizing and measuring the similarity of neural networks with persistent homology

  33. [33]

    The Shape of Data: Intrinsic Distance for Data Distributions

    Anton Tsitsulin et al. “The Shape of Data: Intrinsic Distance for Data Distributions”. In:International Conference on Learning Representations (ICLR). 2020

  34. [34]

    Diachronic word embeddings reveal statistical laws of semantic change

    William L. Hamilton, Jure Leskovec, and Dan Jurafsky. “Diachronic word embeddings reveal statistical laws of semantic change”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016

  35. [35]

    Generalized shape metrics on neural representations

    Alex H. Williams et al. “Generalized shape metrics on neural representations”. In:Advances in Neural Information Processing Systems. 2021

  36. [36]

    Convergent learning: Do different neural networks learn the same representations?

    Yixuan Li et al. “Convergent learning: Do different neural networks learn the same representations?” In:Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015. 2015

  37. [37]

    Cultural shift or linguistic drift? Comparing two computational measures of semantic change

    William L. Hamilton, Jure Leskovec, and Dan Jurafsky. “Cultural shift or linguistic drift? Comparing two computational measures of semantic change”. In:Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016

  38. [38]

    Towards understanding the instability of network embedding

    Chenxu Wang et al. “Towards understanding the instability of network embedding”. In:IEEE Transactions on Knowledge and Data Engineering34.2 (2022), pp. 927–941

  39. [39]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. “Understanding contrastive representation learning through alignment and uniformity on the hypersphere”. In:Proceedings of the 37th International Conference on Machine Learning. 2020

  40. [40]

    On the downstream performance of compressed word embeddings

    Avner May et al. “On the downstream performance of compressed word embeddings”. In:Advances in Neural Informa- tion Processing Systems. 2019

  41. [41]

    GULP: a prediction-based metric between representations

    Enric Boix-Adsera et al. “GULP: a prediction-based metric between representations”. In:Advances in Neural Information Processing Systems. 2022

  42. [42]

    On the dimensionality of word embedding

    Zi Yin and Yuanyuan Shen. “On the dimensionality of word embedding”. In:Advances in Neural Information Processing Systems. 2018

  43. [43]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In:Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019

  44. [44]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Loubna Ben Allal et al. “SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model”. In: arXiv preprint arXiv:2502.02737(2025)

  45. [45]

    Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

    Richard Socher et al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In:Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2013

  46. [46]

    Deep Residual Learning for Image Recognition

    Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

  47. [47]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: International Conference on Learning Representations (ICLR). 2015

  48. [48]

    ImageNet: A Large-Scale Hierarchical Image Database

    Jia Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2009

  49. [49]

    Stress test evaluation for natural language inference

    Aakanksha Naik et al. “Stress test evaluation for natural language inference”. In:Proceedings of the 27th International Conference on Computational Linguistics. 2018

  50. [50]

    BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

    R. Thomas McCoy, Junghyun Min, and Tal Linzen. “BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance”. In:Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Ed. by Afra Alishahi et al. Online: Association for Computational Lingui...