arxiv: 2605.01107 · v1 · submitted 2026-05-01 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Recognition: unknown

Diffusion Operator Geometry of Feedforward Representations

Kanishka Reddy

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:32 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML

keywords diffusion operatorfeedforward representationsBakry-Emery calculusMahalanobis separationGaussian class-conditional modelneural network geometryMarkov operatorrepresentation learning

0 comments

The pith

A Gaussian-kernel diffusion Markov operator on neural feature clouds yields closed-form class affinities and leakage controlled by pairwise Mahalanobis separations, with observables that vary smoothly under perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a diffusion-operator model for the geometry of representations learned by feedforward networks. Each snapshot of features induces a Gaussian-kernel Markov operator whose transport, spectral, boundary, and scale properties are recovered through Bakry-Emery calculus. In the balanced Gaussian class-conditional case with shared covariance these quantities admit explicit formulas governed by the regularized Mahalanobis distances between every pair of classes. The same operator observables change continuously when features are perturbed, whereas neighborhood-graph quantities can jump discontinuously. Synthetic and MNIST experiments confirm that the closed-form expressions hold and that the observables track training progress and width effects.

Core claim

The population operator induced by a balanced Gaussian class-conditional snapshot model with shared covariance possesses closed-form class affinities, leakage terms, and coarse spectra that are all functions of the pairwise regularized Mahalanobis separations c_ε^(a,b). The resulting operator observables vary smoothly under small changes to the underlying feature map, while corresponding hard neighborhood-graph diagnostics can change discontinuously.

What carries the argument

The Gaussian-kernel diffusion Markov operator induced by each feature-cloud snapshot, from which transport, spectral, label-boundary, and local-scale observables are extracted via Bakry-Emery Γ-calculus.

Load-bearing premise

Real feedforward representations can be usefully approximated by balanced Gaussian class-conditional distributions that share a common covariance matrix.

What would settle it

Empirical observation that the operator-derived observables exhibit discontinuous jumps under continuous small perturbations to the features of a trained network, or systematic mismatch between the predicted closed-form affinities and values measured on synthetic Gaussian data.

Figures

Figures reproduced from arXiv: 2605.01107 by Kanishka Reddy.

**Figure 1.** Figure 1: Gaussian bridge validation. Theory versus empirical measurements for balanced Gaussian class-conditionals. Left: mean pairwise separation c¯ε. Middle: cross-class leakage. Right: coarse mixing gap 1 − λ2. The dashed line is the identity. The bridge closely predicts the finite-sample operator observables across separations and bandwidths. diagonal kernel entries Wii = 1. The population formulas in Section 4… view at source ↗

**Figure 2.** Figure 2: Training-time evolution of operator geometry. Top: across hidden layers, training increases mean separation c¯ε, reduces cross-class leakage, decreases the coarse mixing gap, and contracts the soft diffusion radius. Bandwidths are fixed from initialization, so late sharp drops reflect features moving beyond the initial diffusion scale. Bottom: the full separation matrix in the deepest hidden layer at selec… view at source ↗

**Figure 3.** Figure 3: Width dependence of scale-normalized operator observables. Each feature cloud is centered and divided by its RMS feature norm before operator construction, and all widths use a common reference bandwidth. Increasing width raises mean class separation c¯ε and the weakest pairwise separation cmin = mina̸=b c (a,b) ε , while cross-class leakage decreases modestly and the coarse mixing gap changes weakly. Shad… view at source ↗

**Figure 4.** Figure 4: Operator observables are more stable than hard-graph observables. A learned feature cloud is repeatedly perturbed with small Gaussian noise. Top: absolute standard deviation across perturbations. Bottom: relative standard deviation. Error bars show one standard deviation across perturbations. Operator leakage, E[Γ(g, g)], and soft radius vary substantially less than their hard k-NN graph counterparts. edge… view at source ↗

read the original abstract

Neural networks transform data through learned representations whose geometry affects separation, contraction, and generalization. Recent work studies this geometry using discrete curvature on neighborhood graphs, suggesting Ricci-flow-like behavior across layers. We develop a smooth operator-theoretic alternative for feedforward representation snapshots. Each feature cloud induces a Gaussian-kernel diffusion Markov operator, and transport, spectral, label-boundary, and local-scale observables are derived from this single object via Bakry-Emery $\Gamma$-calculus. In a balanced Gaussian class-conditional snapshot model with shared covariance, the population operator has closed-form class affinities, leakage, and coarse spectra, all controlled by pairwise regularized Mahalanobis separations $c_\varepsilon^{(a,b)}$. We also prove that the resulting operator observables vary smoothly under feature perturbations, while hard neighborhood-graph diagnostics can change discontinuously. Synthetic experiments validate the closed-form Gaussian bridge, while learned MNIST experiments show that the same operator observables track training, width, and perturbation stability. Together, these results give a stable operator-geometric framework for analyzing feedforward representation geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives closed-form diffusion-operator expressions under a balanced Gaussian class-conditional model, but the MNIST experiments skip any check on whether real representations satisfy that model.

read the letter

The main things to know are that the work supplies explicit formulas for class affinities, leakage, and coarse spectra in terms of regularized Mahalanobis separations when the snapshot is exactly balanced Gaussian with shared covariance, and it shows these quantities change smoothly under small feature perturbations while discrete neighborhood-graph measures can jump. Synthetic runs confirm the formulas inside the model. That is the concrete advance over prior discrete Ricci-style graph work on layers. The operator construction itself, built from a Gaussian-kernel Markov operator plus Bakry-Emery calculus, is cleanly set up and lets the authors pull several observables from one object. The smoothness argument is also useful on its own terms because it does not require the Gaussian assumption. Those pieces are done carefully enough to be worth reading if you care about continuous alternatives to graph curvature for representation geometry. The soft spot is exactly where the stress-test note points: the MNIST runs apply the observables to trained networks but give no normality tests, covariance-homogeneity checks, or calibration of the Mahalanobis terms against the actual feature clouds. Without that, the advertised control by the c_ε quantities does not transfer, and the interpretability story rests on an unverified approximation. The general operator is still well-defined, but the paper's strongest claims about stable analysis of feedforward geometry depend on the model-specific closed forms. This is for people already working on geometric or operator views of deep representations who want a continuous handle on layer-wise changes. A reader looking for new tools in interpretability or architecture analysis could extract the operator setup and the smoothness result even if they treat the Gaussian bridge as a special case. The paper is coherent, cites the relevant discrete-curvature literature, and does not hide its modeling assumptions, so it deserves a serious referee. I would send it to review with the expectation that referees will press on empirical validation or ask for clearer separation between the model-dependent results and the general operator claims.

Referee Report

1 major / 2 minor

Summary. The manuscript develops a diffusion operator framework for studying the geometry of feedforward neural network representations. Each snapshot of features induces a Gaussian-kernel diffusion Markov operator, from which observables for transport, spectra, label boundaries, and local scales are derived using Bakry-Emery Γ-calculus. Under a balanced Gaussian class-conditional model with shared covariance, closed-form expressions are obtained for class affinities, leakage, and coarse spectra, parameterized by regularized Mahalanobis separations c_ε^{(a,b)}. The paper proves that these observables change smoothly under feature perturbations, in contrast to discontinuous changes in neighborhood-graph based diagnostics. Synthetic experiments confirm the closed-form results under the model, and MNIST experiments illustrate that the observables can track aspects of training, network width, and stability to perturbations.

Significance. Should the central derivations and the smoothness proof be correct, this work contributes a theoretically motivated smooth operator-based approach to representation geometry analysis, offering an alternative to discrete curvature methods on graphs. The closed-form results under the Gaussian model provide a direct link to Mahalanobis geometry, which could aid interpretability. The emphasis on smoothness and stability is a notable strength, and the combination of synthetic validation with real-data application is positive. This could be significant for the field of representation learning and geometric analysis of neural networks.

major comments (1)

[Gaussian model derivation and MNIST experiments] The closed-form class affinities, leakage, and spectra are derived specifically under the balanced Gaussian class-conditional snapshot model with shared covariance (as stated in the abstract). However, the MNIST experiments section applies the general operator observables to learned representations from neural networks without providing any empirical checks (e.g., for joint Gaussianity, shared covariance, or calibration of the Mahalanobis separations c_ε^{(a,b)}) to confirm that the modeling assumptions hold for the real feature clouds. This is a load-bearing issue for the interpretability claims, as deviations from the model would invalidate the specific closed-form controls even if the diffusion operator remains applicable.

minor comments (2)

The abstract mentions 'coarse spectra' without a precise definition; this should be clarified in the main text near the relevant equations.
[Notation] Ensure that the regularization parameter ε is consistently defined and its role in the Mahalanobis separation is explained early in the paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the manuscript's potential contributions and for the detailed comment on the scope of the Gaussian modeling assumptions. We address the concern point by point below, clarifying the separation between the model-specific closed forms and the general operator framework applied to MNIST.

read point-by-point responses

Referee: The closed-form class affinities, leakage, and spectra are derived specifically under the balanced Gaussian class-conditional snapshot model with shared covariance (as stated in the abstract). However, the MNIST experiments section applies the general operator observables to learned representations from neural networks without providing any empirical checks (e.g., for joint Gaussianity, shared covariance, or calibration of the Mahalanobis separations c_ε^{(a,b)}) to confirm that the modeling assumptions hold for the real feature clouds. This is a load-bearing issue for the interpretability claims, as deviations from the model would invalidate the specific closed-form controls even if the diffusion operator remains applicable.

Authors: We agree that the closed-form expressions for class affinities, leakage, and coarse spectra (Section 3) are derived specifically under the balanced Gaussian class-conditional snapshot model with shared covariance and are validated only in the synthetic experiments (Section 4.1). The MNIST experiments (Section 4.2) instead apply the general diffusion operator observables—transport maps, spectral quantities, label-boundary measures, and local scales—obtained via Bakry-Emery Γ-calculus from the Gaussian-kernel Markov operator on arbitrary feature clouds. These general observables are defined without reference to the Gaussian class-conditional assumption or the regularized Mahalanobis parameters c_ε^{(a,b)}. The MNIST results report how these observables evolve with training, vary with network width, and respond to perturbations; they do not invoke or rely on the closed-form controls. Consequently, any deviation from Gaussianity in the MNIST feature clouds does not affect the validity of the reported observations or the smoothness proof (which holds for the general operator). To prevent misreading, we will insert a short clarifying paragraph at the start of Section 4.2 explicitly distinguishing the two uses of the framework. We do not view empirical Gaussianity checks as necessary for the general-operator claims, though we acknowledge the referee's point that such checks could further strengthen interpretability if added in future work. revision: partial

Circularity Check

0 steps flagged

No circularity: closed forms are direct consequences of stated Gaussian model assumptions

full rationale

The paper explicitly derives closed-form class affinities, leakage, and spectra from the balanced Gaussian class-conditional snapshot model with shared covariance, expressing them in terms of regularized Mahalanobis separations c_ε^(a,b). This is a standard mathematical reduction under the model, not a fit to data followed by a renamed prediction. Synthetic experiments are used only to verify that the derived expressions match the assumed generative process. The smoothness result under perturbations is proved separately via Bakry-Emery calculus on the operator and does not rely on the Gaussian closed forms. Application to MNIST uses the general (non-closed-form) observables without asserting that real representations satisfy the modeling assumptions. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems from the same authors appear as load-bearing steps. The derivation chain is therefore self-contained against its explicit premises.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on modeling feature clouds as balanced Gaussian class-conditionals with shared covariance and on standard properties of diffusion operators and Bakry-Emery calculus.

free parameters (1)

regularization parameter ε
Appears in the definition of regularized Mahalanobis separations c_ε that control the closed-form observables.

axioms (2)

domain assumption Feature snapshots are well-approximated by balanced Gaussian class-conditional distributions sharing a common covariance matrix.
Invoked to obtain closed-form expressions for class affinities, leakage, and spectra.
standard math The Gaussian-kernel diffusion Markov operator and Bakry-Emery Γ-calculus apply directly to finite feature clouds.
Background operator theory used to derive transport, spectral, and local-scale observables.

pith-pipeline@v0.9.0 · 5474 in / 1419 out tokens · 69522 ms · 2026-05-09T19:32:21.223351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 28 canonical work pages

[1]

Macke, and Davide Zoccolan

Alessio Ansuini, Alessandro Laio, Jakob H. Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. InAdvances in Neural Information Processing Systems, 2019

2019
[2]

Nat Commun , author =

Uri Cohen, SueYeon Chung, Daniel D. Lee, and Haim Sompolinsky. Separability and geometry of object manifolds in deep neural networks.Nature Communications, 11(1):746, 2020. doi: 10.1038/s41467-020-14578-5

work page doi:10.1038/s41467-020-14578-5 2020
[3]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. InICLR Workshop Track, 2017

2017
[4]

Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training.Proceedings of the National Academy of Sciences, 117 (40):24652–24663, 2020. doi: 10.1073/pnas.2015509117

work page doi:10.1073/pnas.2015509117 2020
[5]

A geometric analysis of neural collapse with unconstrained features

Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. InAdvances in Neural Information Processing Systems, 2021

2021
[6]

Self-consistent dynamical field theory of kernel evolution in wide neural networks

Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. InAdvances in Neural Information Processing Systems, 2022

2022
[7]

MacArthur, and Christopher R

Anthony Baptista, Alessandro Barp, Tapabrata Chakraborti, Chris Harbron, Ben D. MacArthur, and Christopher R. S. Banerji. Deep learning as Ricci flow.Scientific Reports, 14(1):23383,
[8]

doi: 10.1038/s41598-024-74045-9

work page doi:10.1038/s41598-024-74045-9
[9]

arXiv preprint arXiv:2509.22362 , year=

Moritz Hehl, Max-K. von Renesse, and Melanie Weber. Neural feature geometry evolves as discrete Ricci flow.arXiv preprint arXiv:2509.22362, 2025

work page arXiv 2025
[10]

Ricci curvature of Markov chains on metric spaces.Journal of Functional Anal- ysis, 256(3):810–864, 2009

Yann Ollivier. Ricci curvature of Markov chains on metric spaces.Journal of Functional Analysis, 256(3):810–864, 2009. doi: 10.1016/j.jfa.2008.11.001

work page doi:10.1016/j.jfa.2008.11.001 2009
[11]

Bochner’s method for cell complexes and combinatorial Ricci curvature

Robin Forman. Bochner’s method for cell complexes and combinatorial Ricci curvature. Discrete & Computational Geometry, 29:323–374, 2003. doi: 10.1007/s00454-002-0743-x

work page doi:10.1007/s00454-002-0743-x 2003
[12]

RICCI CUR V A TURE OF GRAPHS.Tôhoku mathematical journal, 63(4):605–627, 2011

Yong Lin, Linyuan Lu, and Shing-Tung Yau. Ricci curvature of graphs.Tohoku Mathematical Journal, 63(4):605–627, 2011. doi: 10.2748/tmj/1325886283

work page doi:10.2748/tmj/1325886283 2011
[13]

Cunningham, Gabor Lippner, Carlo Trugenberger, and Dmitri Krioukov

Pim van der Hoorn, William J. Cunningham, Gabor Lippner, Carlo Trugenberger, and Dmitri Krioukov. Ollivier–Ricci curvature convergence in random geometric graphs.Physical Review Research, 3(1):013211, 2021. doi: 10.1103/PhysRevResearch.3.013211

work page doi:10.1103/physrevresearch.3.013211 2021
[14]

Continuum limits of Ollivier’s Ricci curvature on data clouds: pointwise consistency and global lower bounds.arXiv preprint arXiv:2307.02378, 2023

Nicolás García Trillos and Melanie Weber. Continuum limits of Ollivier’s Ricci curvature on data clouds: pointwise consistency and global lower bounds.arXiv preprint arXiv:2307.02378, 2023

work page arXiv 2023
[15]

Applied and computational harmonic analysis21(1), 5–30 (2006) https://doi.org/10.1016/j.acha.2006.04.006

Ronald R. Coifman and Stéphane Lafon. Diffusion maps.Applied and Computational Harmonic Analysis, 21(1):5–30, 2006. doi: 10.1016/j.acha.2006.04.006

work page doi:10.1016/j.acha.2006.04.006 2006
[16]

Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps

Ronald R. Coifman, Stéphane Lafon, Ann B. Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and Steven W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps.Proceedings of the National Academy of Sciences, 102(21):7426–7431, 2005. doi: 10.1073/pnas.0500334102

work page doi:10.1073/pnas.0500334102 2005
[17]

2014 , PAGES =

Dominique Bakry, Ivan Gentil, and Michel Ledoux.Analysis and Geometry of Markov Diffusion Operators, volume 348 ofGrundlehren der mathematischen Wissenschaften. Springer, 2014. doi: 10.1007/978-3-319-00227-9

work page doi:10.1007/978-3-319-00227-9 2014
[18]

Diffusion geometry.arXiv preprint arXiv:2405.10858, 2024

Iolo Jones. Diffusion geometry.arXiv preprint arXiv:2405.10858, 2024

work page arXiv 2024
[19]

Manifold diffusion geometry: Curvature, tangent spaces, and dimension.arXiv preprint arXiv:2411.04100, 2024

Iolo Jones. Manifold diffusion geometry: Curvature, tangent spaces, and dimension.arXiv preprint arXiv:2411.04100, 2024. 10

work page arXiv 2024
[20]

Computing diffusion geometry.arXiv preprint arXiv:2602.06006, 2026

Iolo Jones and David Lanners. Computing diffusion geometry.arXiv preprint arXiv:2602.06006, 2026

work page arXiv 2026
[21]

Daniel Ting, Ling Huang, and Michael I. Jordan. An analysis of the convergence of graph Laplacians. InInternational Conference on Machine Learning (ICML), 2010

2010
[22]

Lipschitz regularity of graph Laplacians on random data clouds.SIAM Journal on Mathematical Analysis, 54(1):1169–1222, 2022

Jeff Calder, Nicolás García Trillos, and Marta Lewicka. Lipschitz regularity of graph Laplacians on random data clouds.SIAM Journal on Mathematical Analysis, 54(1):1169–1222, 2022. doi: 10.1137/20M1356610

work page doi:10.1137/20m1356610 2022
[23]

Topology of deep neural networks

Gregory Naitzat, Andrey Zhitnikov, and Lek-Heng Lim. Topology of deep neural networks. Journal of Machine Learning Research, 21(184):1–40, 2020

2020
[24]

SVCCA: singu- lar vector canonical correlation analysis for deep learning dynamics and interpretability

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: singu- lar vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, 2017

2017
[25]

Morcos, Maithra Raghu, and Samy Bengio

Ari S. Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. InAdvances in Neural Information Processing Systems, 2018

2018
[26]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning, 2019

2019
[27]

X. Y . Han, Vardan Papyan, and David L. Donoho. Neural collapse under MSE loss: proximity to and dynamics on the central path. InInternational Conference on Learning Representations, 2022

2022
[28]

Laplacian eigenmaps for dimensionality reduction and data representation.Neural Computation, 15(6):1373–1396, 2003

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural Computation, 15(6):1373–1396, 2003. doi: 10.1162/ 089976603321780317

2003
[29]

From graph to manifold Laplacian: the convergence rate.Applied and Computa- tional Harmonic Analysis, 21(1):128–134, 2006

Amit Singer. From graph to manifold Laplacian: the convergence rate.Applied and Computa- tional Harmonic Analysis, 21(1):128–134, 2006. doi: 10.1016/j.acha.2006.03.004

work page doi:10.1016/j.acha.2006.03.004 2006
[30]

A tutorial on spectral clustering.Statistics and Computing, 17(4):395–416,

Ulrike von Luxburg. A tutorial on spectral clustering.Statistics and Computing, 17(4):395–416,
[31]

doi: 10.1007/s11222-007-9033-z

work page doi:10.1007/s11222-007-9033-z
[32]

and BOUSQUET, O

Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555–586, 2008. doi: 10.1214/009053607000000640

work page doi:10.1214/009053607000000640 2008
[33]

Coifman, and Ioannis G

Boaz Nadler, Stéphane Lafon, Ronald R. Coifman, and Ioannis G. Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems.Applied and Computational Harmonic Analysis, 21(1):113–127, 2006. doi: 10.1016/j.acha.2005.07.004

work page doi:10.1016/j.acha.2005.07.004 2006
[34]

Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, and Smita Krishnaswamy

Danqi Liao, Chen Liu, Benjamin W. Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, and Smita Krishnaswamy. Assessing neural network representations during training using noise-resilient diffusion spectral entropy. InICML 2023 Workshop on Topology, Algebra, and Geometry in Machine Learning, 2023. doi: 10.48550/ arX...

work page arXiv 2023
[35]

Steindl, Selma Mazioud, Ellie Schueler, Folu Ogundipe, Ellen Zhang, Yvan Grinspan, Kristof Reimann, Peyton Crevasse, Dhananjay Bhaskar, Siddharth Viswanath, Yanlei Zhang, Tim G

Elliott Abel, Andrew J. Steindl, Selma Mazioud, Ellie Schueler, Folu Ogundipe, Ellen Zhang, Yvan Grinspan, Kristof Reimann, Peyton Crevasse, Dhananjay Bhaskar, Siddharth Viswanath, Yanlei Zhang, Tim G. J. Rudner, Ian Adelstein, and Smita Krishnaswamy. Exploring the manifold of neural networks using diffusion geometry.arXiv preprint arXiv:2411.12626, 2024....

work page doi:10.48550/arxiv.2411.12626 2024
[36]

Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity

Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. InAdvances in Neural Information Processing Systems, 2016

2016
[37]

Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein

Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. InInternational Conference on Learning Representations, 2018. 11

2018
[38]

Neural tangent kernel: convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, 2018

2018
[39]

Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein

Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep infor- mation propagation. InInternational Conference on Learning Representations, 2017

2017
[40]

Approximation of dynamical systems by continuous time recurrent neural networks , journal =

Ken-ichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by con- tinuous time recurrent neural networks.Neural Networks, 6(6):801–806, 1993. doi: 10.1016/S0893-6080(05)80125-X

work page doi:10.1016/s0893-6080(05)80125-x 1993
[41]

Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. InAdvances in Neural Information Processing Systems, 2019

2019
[42]

Data-driven model reduction and transfer operator approximation.Journal of Nonlinear Science, 28:985–1010, 2018

Stefan Klus, Feliks Nüske, Péter Koltai, Hao Wu, Ioannis Kevrekidis, Christof Schütte, and Frank Noé. Data-driven model reduction and transfer operator approximation.Journal of Nonlinear Science, 28:985–1010, 2018. doi: 10.1007/s00332-017-9437-7

work page doi:10.1007/s00332-017-9437-7 2018
[43]

A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479,

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479,
[44]

doi: 10.1090/bull/1863

work page doi:10.1090/bull/1863
[45]

The mean-field dynamics of transformers

Philippe Rigollet. The mean-field dynamics of transformers.arXiv preprint arXiv:2512.01868, 2025

work page arXiv 2025
[46]

The emergence of clusters in self-attention dynamics

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. InAdvances in Neural Information Processing Systems, volume 36, pages 57026–57037, 2023. doi: 10.52202/075280-2493

work page doi:10.52202/075280-2493 2023
[47]

Clustering in Causal Attention Masking , url =

Nikita Karagodin, Yury Polyanskiy, and Philippe Rigollet. Clustering in causal attention masking. InAdvances in Neural Information Processing Systems, volume 37, pages 115652– 115681, 2024. doi: 10.52202/079017-3673. 12 A Proofs for the Gaussian operator bridge This appendix gives the derivations behind Theorem 1, Proposition 1, Corollary 1, and Propositi...

work page doi:10.52202/079017-3673 2024