pith. machine review for the scientific record. sign in

arxiv: 2605.01107 · v1 · submitted 2026-05-01 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Recognition: unknown

Diffusion Operator Geometry of Feedforward Representations

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:32 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML
keywords diffusion operatorfeedforward representationsBakry-Emery calculusMahalanobis separationGaussian class-conditional modelneural network geometryMarkov operatorrepresentation learning
0
0 comments X

The pith

A Gaussian-kernel diffusion Markov operator on neural feature clouds yields closed-form class affinities and leakage controlled by pairwise Mahalanobis separations, with observables that vary smoothly under perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a diffusion-operator model for the geometry of representations learned by feedforward networks. Each snapshot of features induces a Gaussian-kernel Markov operator whose transport, spectral, boundary, and scale properties are recovered through Bakry-Emery calculus. In the balanced Gaussian class-conditional case with shared covariance these quantities admit explicit formulas governed by the regularized Mahalanobis distances between every pair of classes. The same operator observables change continuously when features are perturbed, whereas neighborhood-graph quantities can jump discontinuously. Synthetic and MNIST experiments confirm that the closed-form expressions hold and that the observables track training progress and width effects.

Core claim

The population operator induced by a balanced Gaussian class-conditional snapshot model with shared covariance possesses closed-form class affinities, leakage terms, and coarse spectra that are all functions of the pairwise regularized Mahalanobis separations c_ε^(a,b). The resulting operator observables vary smoothly under small changes to the underlying feature map, while corresponding hard neighborhood-graph diagnostics can change discontinuously.

What carries the argument

The Gaussian-kernel diffusion Markov operator induced by each feature-cloud snapshot, from which transport, spectral, label-boundary, and local-scale observables are extracted via Bakry-Emery Γ-calculus.

Load-bearing premise

Real feedforward representations can be usefully approximated by balanced Gaussian class-conditional distributions that share a common covariance matrix.

What would settle it

Empirical observation that the operator-derived observables exhibit discontinuous jumps under continuous small perturbations to the features of a trained network, or systematic mismatch between the predicted closed-form affinities and values measured on synthetic Gaussian data.

Figures

Figures reproduced from arXiv: 2605.01107 by Kanishka Reddy.

Figure 1
Figure 1. Figure 1: Gaussian bridge validation. Theory versus empirical measurements for balanced Gaussian class-conditionals. Left: mean pairwise separation c¯ε. Middle: cross-class leakage. Right: coarse mixing gap 1 − λ2. The dashed line is the identity. The bridge closely predicts the finite-sample operator observables across separations and bandwidths. diagonal kernel entries Wii = 1. The population formulas in Section 4… view at source ↗
Figure 2
Figure 2. Figure 2: Training-time evolution of operator geometry. Top: across hidden layers, training increases mean separation c¯ε, reduces cross-class leakage, decreases the coarse mixing gap, and contracts the soft diffusion radius. Bandwidths are fixed from initialization, so late sharp drops reflect features moving beyond the initial diffusion scale. Bottom: the full separation matrix in the deepest hidden layer at selec… view at source ↗
Figure 3
Figure 3. Figure 3: Width dependence of scale-normalized operator observables. Each feature cloud is centered and divided by its RMS feature norm before operator construction, and all widths use a common reference bandwidth. Increasing width raises mean class separation c¯ε and the weakest pairwise separation cmin = mina̸=b c (a,b) ε , while cross-class leakage decreases modestly and the coarse mixing gap changes weakly. Shad… view at source ↗
Figure 4
Figure 4. Figure 4: Operator observables are more stable than hard-graph observables. A learned feature cloud is repeatedly perturbed with small Gaussian noise. Top: absolute standard deviation across perturbations. Bottom: relative standard deviation. Error bars show one standard deviation across perturbations. Operator leakage, E[Γ(g, g)], and soft radius vary substantially less than their hard k-NN graph counterparts. edge… view at source ↗
read the original abstract

Neural networks transform data through learned representations whose geometry affects separation, contraction, and generalization. Recent work studies this geometry using discrete curvature on neighborhood graphs, suggesting Ricci-flow-like behavior across layers. We develop a smooth operator-theoretic alternative for feedforward representation snapshots. Each feature cloud induces a Gaussian-kernel diffusion Markov operator, and transport, spectral, label-boundary, and local-scale observables are derived from this single object via Bakry-Emery $\Gamma$-calculus. In a balanced Gaussian class-conditional snapshot model with shared covariance, the population operator has closed-form class affinities, leakage, and coarse spectra, all controlled by pairwise regularized Mahalanobis separations $c_\varepsilon^{(a,b)}$. We also prove that the resulting operator observables vary smoothly under feature perturbations, while hard neighborhood-graph diagnostics can change discontinuously. Synthetic experiments validate the closed-form Gaussian bridge, while learned MNIST experiments show that the same operator observables track training, width, and perturbation stability. Together, these results give a stable operator-geometric framework for analyzing feedforward representation geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript develops a diffusion operator framework for studying the geometry of feedforward neural network representations. Each snapshot of features induces a Gaussian-kernel diffusion Markov operator, from which observables for transport, spectra, label boundaries, and local scales are derived using Bakry-Emery Γ-calculus. Under a balanced Gaussian class-conditional model with shared covariance, closed-form expressions are obtained for class affinities, leakage, and coarse spectra, parameterized by regularized Mahalanobis separations c_ε^{(a,b)}. The paper proves that these observables change smoothly under feature perturbations, in contrast to discontinuous changes in neighborhood-graph based diagnostics. Synthetic experiments confirm the closed-form results under the model, and MNIST experiments illustrate that the observables can track aspects of training, network width, and stability to perturbations.

Significance. Should the central derivations and the smoothness proof be correct, this work contributes a theoretically motivated smooth operator-based approach to representation geometry analysis, offering an alternative to discrete curvature methods on graphs. The closed-form results under the Gaussian model provide a direct link to Mahalanobis geometry, which could aid interpretability. The emphasis on smoothness and stability is a notable strength, and the combination of synthetic validation with real-data application is positive. This could be significant for the field of representation learning and geometric analysis of neural networks.

major comments (1)
  1. [Gaussian model derivation and MNIST experiments] The closed-form class affinities, leakage, and spectra are derived specifically under the balanced Gaussian class-conditional snapshot model with shared covariance (as stated in the abstract). However, the MNIST experiments section applies the general operator observables to learned representations from neural networks without providing any empirical checks (e.g., for joint Gaussianity, shared covariance, or calibration of the Mahalanobis separations c_ε^{(a,b)}) to confirm that the modeling assumptions hold for the real feature clouds. This is a load-bearing issue for the interpretability claims, as deviations from the model would invalidate the specific closed-form controls even if the diffusion operator remains applicable.
minor comments (2)
  1. The abstract mentions 'coarse spectra' without a precise definition; this should be clarified in the main text near the relevant equations.
  2. [Notation] Ensure that the regularization parameter ε is consistently defined and its role in the Mahalanobis separation is explained early in the paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the manuscript's potential contributions and for the detailed comment on the scope of the Gaussian modeling assumptions. We address the concern point by point below, clarifying the separation between the model-specific closed forms and the general operator framework applied to MNIST.

read point-by-point responses
  1. Referee: The closed-form class affinities, leakage, and spectra are derived specifically under the balanced Gaussian class-conditional snapshot model with shared covariance (as stated in the abstract). However, the MNIST experiments section applies the general operator observables to learned representations from neural networks without providing any empirical checks (e.g., for joint Gaussianity, shared covariance, or calibration of the Mahalanobis separations c_ε^{(a,b)}) to confirm that the modeling assumptions hold for the real feature clouds. This is a load-bearing issue for the interpretability claims, as deviations from the model would invalidate the specific closed-form controls even if the diffusion operator remains applicable.

    Authors: We agree that the closed-form expressions for class affinities, leakage, and coarse spectra (Section 3) are derived specifically under the balanced Gaussian class-conditional snapshot model with shared covariance and are validated only in the synthetic experiments (Section 4.1). The MNIST experiments (Section 4.2) instead apply the general diffusion operator observables—transport maps, spectral quantities, label-boundary measures, and local scales—obtained via Bakry-Emery Γ-calculus from the Gaussian-kernel Markov operator on arbitrary feature clouds. These general observables are defined without reference to the Gaussian class-conditional assumption or the regularized Mahalanobis parameters c_ε^{(a,b)}. The MNIST results report how these observables evolve with training, vary with network width, and respond to perturbations; they do not invoke or rely on the closed-form controls. Consequently, any deviation from Gaussianity in the MNIST feature clouds does not affect the validity of the reported observations or the smoothness proof (which holds for the general operator). To prevent misreading, we will insert a short clarifying paragraph at the start of Section 4.2 explicitly distinguishing the two uses of the framework. We do not view empirical Gaussianity checks as necessary for the general-operator claims, though we acknowledge the referee's point that such checks could further strengthen interpretability if added in future work. revision: partial

Circularity Check

0 steps flagged

No circularity: closed forms are direct consequences of stated Gaussian model assumptions

full rationale

The paper explicitly derives closed-form class affinities, leakage, and spectra from the balanced Gaussian class-conditional snapshot model with shared covariance, expressing them in terms of regularized Mahalanobis separations c_ε^(a,b). This is a standard mathematical reduction under the model, not a fit to data followed by a renamed prediction. Synthetic experiments are used only to verify that the derived expressions match the assumed generative process. The smoothness result under perturbations is proved separately via Bakry-Emery calculus on the operator and does not rely on the Gaussian closed forms. Application to MNIST uses the general (non-closed-form) observables without asserting that real representations satisfy the modeling assumptions. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems from the same authors appear as load-bearing steps. The derivation chain is therefore self-contained against its explicit premises.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on modeling feature clouds as balanced Gaussian class-conditionals with shared covariance and on standard properties of diffusion operators and Bakry-Emery calculus.

free parameters (1)
  • regularization parameter ε
    Appears in the definition of regularized Mahalanobis separations c_ε that control the closed-form observables.
axioms (2)
  • domain assumption Feature snapshots are well-approximated by balanced Gaussian class-conditional distributions sharing a common covariance matrix.
    Invoked to obtain closed-form expressions for class affinities, leakage, and spectra.
  • standard math The Gaussian-kernel diffusion Markov operator and Bakry-Emery Γ-calculus apply directly to finite feature clouds.
    Background operator theory used to derive transport, spectral, and local-scale observables.

pith-pipeline@v0.9.0 · 5474 in / 1419 out tokens · 69522 ms · 2026-05-09T19:32:21.223351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 28 canonical work pages

  1. [1]

    Macke, and Davide Zoccolan

    Alessio Ansuini, Alessandro Laio, Jakob H. Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. InAdvances in Neural Information Processing Systems, 2019

  2. [2]

    Nat Commun , author =

    Uri Cohen, SueYeon Chung, Daniel D. Lee, and Haim Sompolinsky. Separability and geometry of object manifolds in deep neural networks.Nature Communications, 11(1):746, 2020. doi: 10.1038/s41467-020-14578-5

  3. [3]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. InICLR Workshop Track, 2017

  4. [4]

    Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training.Proceedings of the National Academy of Sciences, 117 (40):24652–24663, 2020. doi: 10.1073/pnas.2015509117

  5. [5]

    A geometric analysis of neural collapse with unconstrained features

    Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. InAdvances in Neural Information Processing Systems, 2021

  6. [6]

    Self-consistent dynamical field theory of kernel evolution in wide neural networks

    Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. InAdvances in Neural Information Processing Systems, 2022

  7. [7]

    MacArthur, and Christopher R

    Anthony Baptista, Alessandro Barp, Tapabrata Chakraborti, Chris Harbron, Ben D. MacArthur, and Christopher R. S. Banerji. Deep learning as Ricci flow.Scientific Reports, 14(1):23383,

  8. [8]

    doi: 10.1038/s41598-024-74045-9

  9. [9]

    arXiv preprint arXiv:2509.22362 , year=

    Moritz Hehl, Max-K. von Renesse, and Melanie Weber. Neural feature geometry evolves as discrete Ricci flow.arXiv preprint arXiv:2509.22362, 2025

  10. [10]

    Ricci curvature of Markov chains on metric spaces.Journal of Functional Anal- ysis, 256(3):810–864, 2009

    Yann Ollivier. Ricci curvature of Markov chains on metric spaces.Journal of Functional Analysis, 256(3):810–864, 2009. doi: 10.1016/j.jfa.2008.11.001

  11. [11]

    Bochner’s method for cell complexes and combinatorial Ricci curvature

    Robin Forman. Bochner’s method for cell complexes and combinatorial Ricci curvature. Discrete & Computational Geometry, 29:323–374, 2003. doi: 10.1007/s00454-002-0743-x

  12. [12]

    RICCI CUR V A TURE OF GRAPHS.Tôhoku mathematical journal, 63(4):605–627, 2011

    Yong Lin, Linyuan Lu, and Shing-Tung Yau. Ricci curvature of graphs.Tohoku Mathematical Journal, 63(4):605–627, 2011. doi: 10.2748/tmj/1325886283

  13. [13]

    Cunningham, Gabor Lippner, Carlo Trugenberger, and Dmitri Krioukov

    Pim van der Hoorn, William J. Cunningham, Gabor Lippner, Carlo Trugenberger, and Dmitri Krioukov. Ollivier–Ricci curvature convergence in random geometric graphs.Physical Review Research, 3(1):013211, 2021. doi: 10.1103/PhysRevResearch.3.013211

  14. [14]

    Continuum limits of Ollivier’s Ricci curvature on data clouds: pointwise consistency and global lower bounds.arXiv preprint arXiv:2307.02378, 2023

    Nicolás García Trillos and Melanie Weber. Continuum limits of Ollivier’s Ricci curvature on data clouds: pointwise consistency and global lower bounds.arXiv preprint arXiv:2307.02378, 2023

  15. [15]

    Applied and computational harmonic analysis21(1), 5–30 (2006) https://doi.org/10.1016/j.acha.2006.04.006

    Ronald R. Coifman and Stéphane Lafon. Diffusion maps.Applied and Computational Harmonic Analysis, 21(1):5–30, 2006. doi: 10.1016/j.acha.2006.04.006

  16. [16]

    Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps

    Ronald R. Coifman, Stéphane Lafon, Ann B. Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and Steven W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps.Proceedings of the National Academy of Sciences, 102(21):7426–7431, 2005. doi: 10.1073/pnas.0500334102

  17. [17]

    2014 , PAGES =

    Dominique Bakry, Ivan Gentil, and Michel Ledoux.Analysis and Geometry of Markov Diffusion Operators, volume 348 ofGrundlehren der mathematischen Wissenschaften. Springer, 2014. doi: 10.1007/978-3-319-00227-9

  18. [18]

    Diffusion geometry.arXiv preprint arXiv:2405.10858, 2024

    Iolo Jones. Diffusion geometry.arXiv preprint arXiv:2405.10858, 2024

  19. [19]

    Manifold diffusion geometry: Curvature, tangent spaces, and dimension.arXiv preprint arXiv:2411.04100, 2024

    Iolo Jones. Manifold diffusion geometry: Curvature, tangent spaces, and dimension.arXiv preprint arXiv:2411.04100, 2024. 10

  20. [20]

    Computing diffusion geometry.arXiv preprint arXiv:2602.06006, 2026

    Iolo Jones and David Lanners. Computing diffusion geometry.arXiv preprint arXiv:2602.06006, 2026

  21. [21]

    Daniel Ting, Ling Huang, and Michael I. Jordan. An analysis of the convergence of graph Laplacians. InInternational Conference on Machine Learning (ICML), 2010

  22. [22]

    Lipschitz regularity of graph Laplacians on random data clouds.SIAM Journal on Mathematical Analysis, 54(1):1169–1222, 2022

    Jeff Calder, Nicolás García Trillos, and Marta Lewicka. Lipschitz regularity of graph Laplacians on random data clouds.SIAM Journal on Mathematical Analysis, 54(1):1169–1222, 2022. doi: 10.1137/20M1356610

  23. [23]

    Topology of deep neural networks

    Gregory Naitzat, Andrey Zhitnikov, and Lek-Heng Lim. Topology of deep neural networks. Journal of Machine Learning Research, 21(184):1–40, 2020

  24. [24]

    SVCCA: singu- lar vector canonical correlation analysis for deep learning dynamics and interpretability

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: singu- lar vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, 2017

  25. [25]

    Morcos, Maithra Raghu, and Samy Bengio

    Ari S. Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. InAdvances in Neural Information Processing Systems, 2018

  26. [26]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning, 2019

  27. [27]

    X. Y . Han, Vardan Papyan, and David L. Donoho. Neural collapse under MSE loss: proximity to and dynamics on the central path. InInternational Conference on Learning Representations, 2022

  28. [28]

    Laplacian eigenmaps for dimensionality reduction and data representation.Neural Computation, 15(6):1373–1396, 2003

    Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural Computation, 15(6):1373–1396, 2003. doi: 10.1162/ 089976603321780317

  29. [29]

    From graph to manifold Laplacian: the convergence rate.Applied and Computa- tional Harmonic Analysis, 21(1):128–134, 2006

    Amit Singer. From graph to manifold Laplacian: the convergence rate.Applied and Computa- tional Harmonic Analysis, 21(1):128–134, 2006. doi: 10.1016/j.acha.2006.03.004

  30. [30]

    A tutorial on spectral clustering.Statistics and Computing, 17(4):395–416,

    Ulrike von Luxburg. A tutorial on spectral clustering.Statistics and Computing, 17(4):395–416,

  31. [31]

    doi: 10.1007/s11222-007-9033-z

  32. [32]

    and BOUSQUET, O

    Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555–586, 2008. doi: 10.1214/009053607000000640

  33. [33]

    Coifman, and Ioannis G

    Boaz Nadler, Stéphane Lafon, Ronald R. Coifman, and Ioannis G. Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems.Applied and Computational Harmonic Analysis, 21(1):113–127, 2006. doi: 10.1016/j.acha.2005.07.004

  34. [34]

    Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, and Smita Krishnaswamy

    Danqi Liao, Chen Liu, Benjamin W. Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, and Smita Krishnaswamy. Assessing neural network representations during training using noise-resilient diffusion spectral entropy. InICML 2023 Workshop on Topology, Algebra, and Geometry in Machine Learning, 2023. doi: 10.48550/ arX...

  35. [35]

    Steindl, Selma Mazioud, Ellie Schueler, Folu Ogundipe, Ellen Zhang, Yvan Grinspan, Kristof Reimann, Peyton Crevasse, Dhananjay Bhaskar, Siddharth Viswanath, Yanlei Zhang, Tim G

    Elliott Abel, Andrew J. Steindl, Selma Mazioud, Ellie Schueler, Folu Ogundipe, Ellen Zhang, Yvan Grinspan, Kristof Reimann, Peyton Crevasse, Dhananjay Bhaskar, Siddharth Viswanath, Yanlei Zhang, Tim G. J. Rudner, Ian Adelstein, and Smita Krishnaswamy. Exploring the manifold of neural networks using diffusion geometry.arXiv preprint arXiv:2411.12626, 2024....

  36. [36]

    Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity

    Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. InAdvances in Neural Information Processing Systems, 2016

  37. [37]

    Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein

    Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. InInternational Conference on Learning Representations, 2018. 11

  38. [38]

    Neural tangent kernel: convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, 2018

  39. [39]

    Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein

    Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep infor- mation propagation. InInternational Conference on Learning Representations, 2017

  40. [40]

    Approximation of dynamical systems by continuous time recurrent neural networks , journal =

    Ken-ichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by con- tinuous time recurrent neural networks.Neural Networks, 6(6):801–806, 1993. doi: 10.1016/S0893-6080(05)80125-X

  41. [41]

    Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. InAdvances in Neural Information Processing Systems, 2019

  42. [42]

    Data-driven model reduction and transfer operator approximation.Journal of Nonlinear Science, 28:985–1010, 2018

    Stefan Klus, Feliks Nüske, Péter Koltai, Hao Wu, Ioannis Kevrekidis, Christof Schütte, and Frank Noé. Data-driven model reduction and transfer operator approximation.Journal of Nonlinear Science, 28:985–1010, 2018. doi: 10.1007/s00332-017-9437-7

  43. [43]

    A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479,

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479,

  44. [44]

    doi: 10.1090/bull/1863

  45. [45]

    The mean-field dynamics of transformers

    Philippe Rigollet. The mean-field dynamics of transformers.arXiv preprint arXiv:2512.01868, 2025

  46. [46]

    The emergence of clusters in self-attention dynamics

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. InAdvances in Neural Information Processing Systems, volume 36, pages 57026–57037, 2023. doi: 10.52202/075280-2493

  47. [47]

    Clustering in Causal Attention Masking , url =

    Nikita Karagodin, Yury Polyanskiy, and Philippe Rigollet. Clustering in causal attention masking. InAdvances in Neural Information Processing Systems, volume 37, pages 115652– 115681, 2024. doi: 10.52202/079017-3673. 12 A Proofs for the Gaussian operator bridge This appendix gives the derivations behind Theorem 1, Proposition 1, Corollary 1, and Propositi...