pith. sign in

arxiv: 2606.19268 · v2 · pith:BEM4L6EGnew · submitted 2026-06-17 · 🧮 math.ST · cs.CG· stat.TH

Patnaik-Pearson intrinsic dimension for internal representations of neural networks

Pith reviewed 2026-07-03 23:39 UTC · model grok-4.3

classification 🧮 math.ST cs.CGstat.TH
keywords intrinsic dimensionneural networkstransformerspower lawempirical spectral densitytoken embeddingsBERT
0
0 comments X

The pith

The Patnaik-Pearson dimension measures intrinsic dimension in neural network representations and aligns critical tail exponents with HTSR and SETOL when weight matrices follow Pareto spectra.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the Patnaik-Pearson dimension as an intrinsic dimension estimator for data manifolds, drawing on the TwoNN method and spectral analyses from HTSR and SETOL. It establishes properties of the estimator and demonstrates that, for weight matrices whose empirical spectral density follows a power-law distribution, the critical values of the tail exponent match those from the earlier HTSR and SETOL frameworks. The work then tracks how this dimension changes under standard neural network operations and applies the measure to token embeddings in BERT-base and DeepSeek-R1-Distill-Qwen-1, observing its evolution across layers.

Core claim

For weight matrices whose Empirical Spectral Density follows a Pareto distribution, the Patnaik-Pearson dimension relates directly to HTSR and SETOL analysis such that the critical values of the tail exponent coincide between the two approaches.

What carries the argument

The Patnaik-Pearson dimension, an intrinsic dimension estimator constructed from the TwoNN method combined with Patnaik-Pearson statistics, applied to weight matrices and token embeddings treated as data manifolds.

If this is right

  • The dimension of the initial token embedding manifold can be computed directly for transformer models.
  • The Patnaik-Pearson dimension evolves in measurable ways as embeddings pass through successive layers.
  • The coincidence of critical tail exponents holds specifically under the Pareto assumption for spectral densities.
  • Numerical evaluation on real models like BERT-base confirms the dimension can be tracked layer by layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the dimension stabilizes or changes predictably across layers, it may indicate fixed points in how networks compress or expand manifolds.
  • The approach could extend to other architectures by treating attention weights or activations as additional manifolds.
  • Discrepancies in non-Pareto cases might highlight when spectral methods lose their direct connection to intrinsic dimension estimates.

Load-bearing premise

The empirical spectral density of the weight matrices follows a Pareto power-law distribution.

What would settle it

Empirical computation on a weight matrix whose spectral density deviates from Pareto form, showing that the Patnaik-Pearson critical tail exponent no longer matches the HTSR or SETOL value.

Figures

Figures reproduced from arXiv: 2606.19268 by Tom Hadfield.

Figure 1
Figure 1. Figure 1: 1000 points in R 2 with Patnaik-Pearson dimension 1.514, TwoNN dimension 1.942. This suggests that the Patnaik-Pearson dimension may be thought of as a “global” measure of dimension, whereas the TwoNN dimension captures local dimensionality. 0 200 400 600 800 1000 actual dimension 0 200 400 600 800 intrinsic dimension Patnaik-Pearson and TwoNN for a solid ball of varying dimension embedded in 1000-dimensio… view at source ↗
Figure 2
Figure 2. Figure 2: Patnaik-Pearson and TwoNN dimension estimates for a solid ball of dimension varying between [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Patnaik-Pearson and TwoNN dimension estimates for a solid ball of dimension varying between [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Uniform distribution : numerical tests of (29), (a) for 10 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Uniform distribution : numerical tests of (31), (a) for 10 [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Marchenko-Pastur : numerical tests of (30), [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Marchenko-Pastur : numerical tests of (32), [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) ν(α,d) d as α varies, for d increasing from 100 to 1,000,000, together with ν∞(α). (b) Difference of ν(α,d) d from conjectured limit as α varies, for d increasing from 100 to 1,000,000. and the result follows. For 2 > α > 1, then E(Y ) is finite, but E(Y 2 ) is infinite. By Kolmogorov’s SLLN [8], 1 d X d i=1 λi a.s → E(λ) < ∞, 1 d X d i=1 λ 2 i a.s → ∞ Hence 1 d ν(α, d) →0 as d →∞. For 1 > α, then both… view at source ↗
Figure 6
Figure 6. Figure 6: Numerical tests of (38, 41), for α = 1.5 and α = 2.5 and increasing d. So for α ̸= 1, 2, then X d k=1 λk ≈ d s 1 − s [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: Numerical tests of (40, 43), for α = 1.5 and α = 2.5 and increasing d. So, for α > 2, limd → ∞ 1 d ν(α, d) = C(α), and, further, for large d then ln( 1 d ν(α, d) − C(α)) = ln(C(α)) − [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Numerical tests of (39) and (42), for α = 1.0 and α = 2.0, for increasing d Tail exponent α ν(α, d) for large d 1 d ν(α, d) for large d α > 2 C(α)d C(α) α = 2 4d ln(d) 4 ln(d) 2 > α > 1 −C(α)d 2  α−1 α  −C(α)d −  2−α α  α = 1 (ln(d))2 1 d (ln(d))2 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Product hypotheses (50). 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 nu/d (X) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 nu/d Pareto : nu/d (XTX) compared to nu/d (XT) * nu/d (X) as nu/d (X) varies N, d randomly chosen between 500 and 1000, with N > d nu/d (XTX) nu/d (XT) * nu/d (X) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nu/d (W) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nu/d (softmax(W)) Pareto : actual and estimated nu/d (softm… view at source ↗
Figure 8
Figure 8. Figure 8: Product hypotheses (52). 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 nu/d (X) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 nu/d Pareto : nu/d (XTX) compared to nu/d (XT) * nu/d (X) as nu/d (X) varies N, d randomly chosen between 500 and 1000, with N > d nu/d (XTX) nu/d (XT) * nu/d (X) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nu/d (W) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nu/d (softmax(W)) Pareto : actual and estimated nu/d (softm… view at source ↗
Figure 9
Figure 9. Figure 9: (a) Product hypothesis (50) is not satisfied by [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) Product hypothesis (52) is not satisfied by [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Patnaik-Pearson dimension for XW XT using (50). (b) Patnaik-Pearson dimension of Attention(Q, K, V ), using (52, 53) 4.6 Activation functions : ReLU We investigate the effect of the ReLU activation function. For X an N × d data manifold, define (ReLU(X))ij = ( Xij : Xij ≥ 0 0 : Xij < 0 , (56) [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Patnaik-Pearson dimension for XW XT using (52). (b) Patnaik-Pearson dimension of Attention(Q, K, V ), using (54, 55) 0.0 0.2 0.4 0.6 0.8 1.0 nu/d X 0.0 0.2 0.4 0.6 0.8 1.0 nu/d ReLU(X) nu/d ReLU(X) vs nu/d X N = 1000, d = 1000 nu/d ReLU(X) nu/d X approx nu/d ReLU(X) 1 2 3 4 5 6 alpha(X) 1 2 3 4 5 6 alpha(ReLU(X)) alpha(ReLU(X)) vs alpha(X) N = 1000, d = 1000 alpha(ReLU(X)) alpha(X) approx alpha(ReLU(X… view at source ↗
Figure 11
Figure 11. Figure 11: The effect of ReLU on Patnaik-Pearson dimension. Approximation given by (57). [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sum of two matrices X1 + X2 : heavy-tails dominate. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sum of two matrices X1 + X2 : heavy-tails dominate. 4.7 Addition, Interpolation and Concatenation For X1 and X2 both N × d, define X1 + X2 as the usual matrix addition. As shown in [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Interpolation between two data manifolds [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 13
Figure 13. Figure 13: Interpolation between two data manifolds [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Concatenation : Patnaik-Pearson dimension of [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Normalisation : Patnaik-Pearson dimension for [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 15
Figure 15. Figure 15: Normalisation : Patnaik-Pearson dimension for [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Patnaik-Pearson dimension and nu/d for sampled BERT token embeddings [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 16
Figure 16. Figure 16: Patnaik-Pearson dimension and nu/d for sampled BERT token embeddings [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Layerwise evolution of Patnaik-Pearson dimension for BERT. [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 17
Figure 17. Figure 17: Layerwise evolution of Patnaik-Pearson dimension for BERT. [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Deepseek token embeddings: Patnaik-Pearson dimension and nu/d for samples of token [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 18
Figure 18. Figure 18: Deepseek token embeddings: Patnaik-Pearson dimension and nu/d for samples of token [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Layerwise evolution of the Patnaik-Pearson dimension of Deepseek embeddings. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 19
Figure 19. Figure 19: Layerwise evolution of the Patnaik-Pearson dimension of Deepseek embeddings. [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
read the original abstract

We define a new measure of intrinsic dimension of a data manifold, which we call the Patnaik-Pearson dimension, and apply this to internal representations of neural networks, in particular transformers. The inspiration for this comes from the HTSR and SETOL work of Martin, Mahoney and Hinrichs, combined with the TwoNN intrinsic dimension estimator of Facco et al. We prove various properties of this intrinsic dimension estimator. Treating weight matrices of neural networks as data manifolds, for weight matrices whose Empirical Spectral Density follows a Pareto (Power Law) distribution, we relate the Patnaik-Pearson dimension to the HTSR and SETOL analysis, and show that critical values of the tail exponent coincide for the two approaches. Using a combination of theoretical and numerical techniques, we study the behaviour of the Patnaik-Pearson dimension of a data manifold under the transformations typical to neural networks. We apply this machinery to the BERT-base and DeepSeek-R1-Distill-Qwen-1 models, to investigate first the Patnaik-Pearson dimension of the initial data manifold of token embeddings, and second the evolution of the Patnaik-Pearson dimension as token embeddings pass through the layers of the model. Code and notebooks used for the numerical results presented here is available at https://github.com/tdhadfield/PatnaikPearson

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript defines the Patnaik-Pearson intrinsic dimension estimator for data manifolds, drawing from HTSR/SETOL and the TwoNN method. It proves properties of the estimator, relates the dimension to HTSR/SETOL analysis specifically for weight matrices whose empirical spectral density follows a Pareto distribution (showing coincidence of critical tail exponents), examines the estimator's behavior under typical neural network transformations, and applies it to study token embeddings and their evolution through layers in BERT-base and DeepSeek-R1-Distill-Qwen-1 models. Code is provided for the numerical results.

Significance. If the conditional relation holds and the applications are robust, the work offers a new intrinsic dimension tool that connects to heavy-tailed spectral regularization frameworks, potentially aiding analysis of representation geometry in transformers. The explicit conditioning on the Pareto ESD and the availability of reproducible code are strengths.

minor comments (3)
  1. The abstract states that the relation to HTSR/SETOL holds for weight matrices with Pareto ESD, but the applications focus on token embeddings; clarify in the introduction or §3 whether the Pareto condition is checked or relevant for the BERT/DeepSeek experiments.
  2. The manuscript mentions proving 'various properties' of the estimator; ensure the main text explicitly lists these properties with references to the relevant theorems or propositions.
  3. Minor notation inconsistencies may exist between the Patnaik-Pearson definition and the TwoNN baseline; verify consistency in the methods section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive assessment of the manuscript, including the recommendation for minor revision. No specific major comments are provided in the report, so there are no individual points to address point-by-point. We are happy to make any minor changes the editor may request based on the overall summary.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper defines the Patnaik-Pearson dimension by combining the TwoNN estimator with concepts from HTSR/SETOL, proves independent properties of the new estimator, and derives the relation between critical tail exponents only conditionally on the explicit assumption that ESD follows a Pareto distribution. No step equates the new dimension to a fitted input or prior result by construction, and no self-citation chain is load-bearing for the central claims. The applications to BERT and DeepSeek are presented as separate empirical uses of the defined machinery.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; the ledger is therefore provisional. The new dimension is constructed from the TwoNN estimator and the Pareto assumption on spectral density; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5772 in / 1123 out tokens · 18102 ms · 2026-07-03T23:39:29.548119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 27 canonical work pages · 6 internal anchors

  1. [1]

    Macke, Davide Zoccolan,Intrinsic dimension of data representations in deep neural networks.https://arxiv.org/abs/1905.12784

    Alessio Ansuini, Alessandro Laio, Jakob H. Macke, Davide Zoccolan,Intrinsic dimension of data representations in deep neural networks.https://arxiv.org/abs/1905.12784

  2. [2]

    Hari Bercovici, Vittorino Pata, Philippe Biane,Stable Laws and Domains of Attraction in Free Probability Theory.Annals of Mathematics, 1999-05, Vol.149 (3), p.1023-1060

  3. [3]

    https://arxiv.org/abs/2312.14688

    Nicolas Boull´ e, Alex Townsend,A Mathematical Guide to Operator Learning. https://arxiv.org/abs/2312.14688

  4. [4]

    Cizeau, J

    P. Cizeau, J. P. Bouchaud,Theory of Levy matrices.Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1994 Sep, 50(3):1810-1822

  5. [5]

    DeepSeek-AI,DeepSeek−R1: Incentivising Reasoning Capability in LLMs via Reinforcement Learn- ing.https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek R1.pdf https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

  6. [6]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova,BERT : Pre-Training of Deep Bidi- rectional Transformers for Language Understanding.Proceedings of NAACL-HLT 2019, Minneapolis, 2-7 June 2019, 4171-7186. https://arxiv.org/abs/1810.04805

  7. [7]

    Riccardo Di Sipio, Jairo Diaz-Rodriguez, Luis Serrano,The Curved Spacetime of Transformer Archi- tectures.https://arxiv.org/abs/2511.03060

  8. [8]

    Paul Embrechts, Claudia Kl¨ uppelberg, Thomas Mikosch,Modelling Extremal Events : for Insurance and Finance.Springer Berlin Heidelberg (2003)

  9. [9]

    et al.,Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Sci Rep 7, 12140 (2017)

    Facco, E., d’Errico, M., Rodriguez, A. et al.,Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Sci Rep 7, 12140 (2017). https://doi.org/10.1038/s41598-017-11873-y https://www.nature.com/articles/s41598-017-11873-y

  10. [10]

    The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

    Aideen Fay, In´ es Garc´ ıa-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod,The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology. https://arxiv.org/abs/2505.20435

  11. [11]

    Testing the Manifold Hypothesis

    Charles Fefferman, Sanjoy Mitter, Hariharan Narayanan,Testing the Manifold Hypothesis. https://arxiv.org/abs/1310.0425

  12. [12]

    Stephen Fitz, Peter Romero, Jiyan Jonas Schneider,Hidden Holes: Topological Aspects of Language Models.https://arxiv.org/abs/2406.05798 32

  13. [13]

    Sergey Foss, Dmitry Korshunov, Stan Zachary,An Introduction to Heavy-Tailed and Subexponential Distributions.Springer (2012)

  14. [14]

    In: International Conference on Learning Representations (ICLR)

    Yuri Gardinazzi, Giada Panerai, Karthik Viswanathan, Alessio Ansuini, Alberto Caz- zaniga, Matteo Biagetti,Persistent Topological Features in Large Language Models. https://arxiv.org/abs/2410.11042

  15. [15]

    https://arxiv.org/abs/2402.00949

    Kaie Kubjas, Jiayi Li, Maximilian Wiesmann,Geometry of Polynomial Neural Networks. https://arxiv.org/abs/2402.00949

  16. [16]

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet,A mathematical perspective on Transformers.https://arxiv.org/abs/2312.10794

  17. [17]

    https://arxiv.org/abs/2509.11088

    Alexandros Grosdos, Elina Robeva, Maksym Zubkov,Algebraic geometry of rational neural networks. https://arxiv.org/abs/2509.11088

  18. [18]

    Grubbs, Helen J

    Frank E. Grubbs, Helen J. Coon, E. S. Pearson,On the Use of Patnaik Type Chi Approximations to the Range in Significance Tests.Biometrika, Vol. 53, No. 1/2 (Jun., 1966), pp. 248-252. https://doi.org/10.2307/2334073 https://www.jstor.org/stable/2334073

  19. [19]

    https://arxiv.org/abs/2204.08624

    German Magai, Anton Ayzenberg,Topology and Geometry of Data Manifold in Deep Learning. https://arxiv.org/abs/2204.08624

  20. [20]

    Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl´ en Kohn, An Invitation to Neuroalgebraic Geometry.https://arxiv.org/abs/2501.18915v1

  21. [21]

    Martin,WeightWatcher.https://github.com/CalculatedContent/WeightWatcher

    Charles H. Martin,WeightWatcher.https://github.com/CalculatedContent/WeightWatcher

  22. [22]

    Martin,A Spectral Renormalization-Group View of Learning

    Charles H. Martin,A Spectral Renormalization-Group View of Learning. https://www.linkedin.com/posts/charlesmartin14 a-spectral-renormalization-group-view-of- ugcPost-7471078735861944321-jXs

  23. [23]

    Martin, Christopher Hinrichs,SETOL: A Semi-Empirical Theory of (Deep) Learning

    Charles H. Martin, Christopher Hinrichs,SETOL: A Semi-Empirical Theory of (Deep) Learning. https://arxiv.org/abs/2507.17912

  24. [24]

    Traditional and Heavy-Tailed Self Regularization in Neural Network Models

    Charles H. Martin, Michael W. Mahoney,Traditional and Heavy-Tailed Self Regularization in Neural Network Models.https://arxiv.org/abs/1901.08276

  25. [25]

    Govind Menon,The geometry of the deep linear network.https://arxiv.org/abs/2411.09004

  26. [26]

    Thomas Mikosch, Olivier Wintenberger,Extreme Value Theory for Time Series Models with Power- Law Tails.Springer, 2024

  27. [27]

    https://papers.ssrn.com/sol3/papers.cfm?abstract id=6073468

    Miquel Noguer I Alonso,The Complete Mathematics of Transformers: A Rigorous Treatment with Full Derivations, Proofs, and Theoretical Foundations. https://papers.ssrn.com/sol3/papers.cfm?abstract id=6073468

  28. [28]

    Eng-Jon Ong, Omer Bobrowski, Gesine Reinert, Primoz Skraba,A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality.https://arxiv.org/abs/2603.10493v1

  29. [29]

    B.,The non-centralχ 2- and F-distributions and their application.Biometrika, 36(1/2) (1949), 202-232

    Patnaik, P. B.,The non-centralχ 2- and F-distributions and their application.Biometrika, 36(1/2) (1949), 202-232. https://doi.org/10.2307/2332542 33

  30. [30]

    S.,Note on an approximation to the distribution of noncentralχ 2.Biometrika, 46(3/4), (1959) 364

    Pearson, E. S.,Note on an approximation to the distribution of noncentralχ 2.Biometrika, 46(3/4), (1959) 364. https://doi.org/10.2307/2333533

  31. [31]

    Marc Potters, Jean-Philippe Bouchaud,A first course in Random Matrix Theory : for physicists, engineers and data scientists.Cambridge University Press (2021)

  32. [32]

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra,Grokking: Generaliza- tion Beyond Overfitting on Small Algorithmic Datasets.https://arxiv.org/abs/2201.02177

  33. [33]

    Prakash, Charles H

    Hari K. Prakash, Charles H. Martin,Grokking and Generalization Collapse: Insights fromHTSR theory.https://arxiv.org/abs/2506.04434

  34. [34]

    Philippe Rigollet,The Mean-Field Dynamics of Transformers.https://arxiv.org/abs/2512.01868

  35. [35]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin,Attention Is All You Need.https://arxiv.org/abs/1706.03762

  36. [36]

    Voiculescu, K

    D. Voiculescu, K. Dykema, A. Nica,Free random variables.CRM Monograph Series, No. 1, A.M.S., Providence, RI, 1992

  37. [37]

    https://arxiv.org/abs/2007.02876

    James Vuckovic, Aristide Baratin, Remi Tachet des Combes,A Mathematical Theory of Attention. https://arxiv.org/abs/2007.02876

  38. [38]

    Suppose we have a region inR d, containing random points with (uniform) densityρ

    Nick Whiteley, Annie Gray, Patrick Rubin-Delanchy,Statistical exploration of the Manifold Hypoth- esis.https://arxiv.org/abs/2208.11665v3 8 Appendix : The TwoNN intrinsic dimension formula We derive the formulae given in [9] which we use in Section 2.3, in particular (9). Suppose we have a region inR d, containing random points with (uniform) densityρ. Th...