Patnaik-Pearson intrinsic dimension for internal representations of neural networks

Tom Hadfield

arxiv: 2606.19268 · v2 · pith:BEM4L6EGnew · submitted 2026-06-17 · 🧮 math.ST · cs.CG· stat.TH

Patnaik-Pearson intrinsic dimension for internal representations of neural networks

Tom Hadfield This is my paper

Pith reviewed 2026-07-03 23:39 UTC · model grok-4.3

classification 🧮 math.ST cs.CGstat.TH

keywords intrinsic dimensionneural networkstransformerspower lawempirical spectral densitytoken embeddingsBERT

0 comments

The pith

The Patnaik-Pearson dimension measures intrinsic dimension in neural network representations and aligns critical tail exponents with HTSR and SETOL when weight matrices follow Pareto spectra.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the Patnaik-Pearson dimension as an intrinsic dimension estimator for data manifolds, drawing on the TwoNN method and spectral analyses from HTSR and SETOL. It establishes properties of the estimator and demonstrates that, for weight matrices whose empirical spectral density follows a power-law distribution, the critical values of the tail exponent match those from the earlier HTSR and SETOL frameworks. The work then tracks how this dimension changes under standard neural network operations and applies the measure to token embeddings in BERT-base and DeepSeek-R1-Distill-Qwen-1, observing its evolution across layers.

Core claim

For weight matrices whose Empirical Spectral Density follows a Pareto distribution, the Patnaik-Pearson dimension relates directly to HTSR and SETOL analysis such that the critical values of the tail exponent coincide between the two approaches.

What carries the argument

The Patnaik-Pearson dimension, an intrinsic dimension estimator constructed from the TwoNN method combined with Patnaik-Pearson statistics, applied to weight matrices and token embeddings treated as data manifolds.

If this is right

The dimension of the initial token embedding manifold can be computed directly for transformer models.
The Patnaik-Pearson dimension evolves in measurable ways as embeddings pass through successive layers.
The coincidence of critical tail exponents holds specifically under the Pareto assumption for spectral densities.
Numerical evaluation on real models like BERT-base confirms the dimension can be tracked layer by layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the dimension stabilizes or changes predictably across layers, it may indicate fixed points in how networks compress or expand manifolds.
The approach could extend to other architectures by treating attention weights or activations as additional manifolds.
Discrepancies in non-Pareto cases might highlight when spectral methods lose their direct connection to intrinsic dimension estimates.

Load-bearing premise

The empirical spectral density of the weight matrices follows a Pareto power-law distribution.

What would settle it

Empirical computation on a weight matrix whose spectral density deviates from Pareto form, showing that the Patnaik-Pearson critical tail exponent no longer matches the HTSR or SETOL value.

Figures

Figures reproduced from arXiv: 2606.19268 by Tom Hadfield.

**Figure 1.** Figure 1: 1000 points in R 2 with Patnaik-Pearson dimension 1.514, TwoNN dimension 1.942. This suggests that the Patnaik-Pearson dimension may be thought of as a “global” measure of dimension, whereas the TwoNN dimension captures local dimensionality. 0 200 400 600 800 1000 actual dimension 0 200 400 600 800 intrinsic dimension Patnaik-Pearson and TwoNN for a solid ball of varying dimension embedded in 1000-dimensio… view at source ↗

**Figure 2.** Figure 2: Patnaik-Pearson and TwoNN dimension estimates for a solid ball of dimension varying between [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Uniform distribution : numerical tests of (29), (a) for 10 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 3.** Figure 3: Uniform distribution : numerical tests of (31), (a) for 10 [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Marchenko-Pastur : numerical tests of (30), [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 4.** Figure 4: Marchenko-Pastur : numerical tests of (32), [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: (a) ν(α,d) d as α varies, for d increasing from 100 to 1,000,000, together with ν∞(α). (b) Difference of ν(α,d) d from conjectured limit as α varies, for d increasing from 100 to 1,000,000. and the result follows. For 2 > α > 1, then E(Y ) is finite, but E(Y 2 ) is infinite. By Kolmogorov’s SLLN [8], 1 d X d i=1 λi a.s → E(λ) < ∞, 1 d X d i=1 λ 2 i a.s → ∞ Hence 1 d ν(α, d) →0 as d →∞. For 1 > α, then both… view at source ↗

**Figure 6.** Figure 6: Numerical tests of (38, 41), for α = 1.5 and α = 2.5 and increasing d. So for α ̸= 1, 2, then X d k=1 λk ≈ d s 1 − s [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 6.** Figure 6: Numerical tests of (40, 43), for α = 1.5 and α = 2.5 and increasing d. So, for α > 2, limd → ∞ 1 d ν(α, d) = C(α), and, further, for large d then ln( 1 d ν(α, d) − C(α)) = ln(C(α)) − [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Numerical tests of (39) and (42), for α = 1.0 and α = 2.0, for increasing d Tail exponent α ν(α, d) for large d 1 d ν(α, d) for large d α > 2 C(α)d C(α) α = 2 4d ln(d) 4 ln(d) 2 > α > 1 −C(α)d 2 α−1 α −C(α)d − 2−α α α = 1 (ln(d))2 1 d (ln(d))2 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Product hypotheses (50). 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 nu/d (X) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 nu/d Pareto : nu/d (XTX) compared to nu/d (XT) * nu/d (X) as nu/d (X) varies N, d randomly chosen between 500 and 1000, with N > d nu/d (XTX) nu/d (XT) * nu/d (X) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nu/d (W) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nu/d (softmax(W)) Pareto : actual and estimated nu/d (softm… view at source ↗

**Figure 8.** Figure 8: Product hypotheses (52). 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 nu/d (X) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 nu/d Pareto : nu/d (XTX) compared to nu/d (XT) * nu/d (X) as nu/d (X) varies N, d randomly chosen between 500 and 1000, with N > d nu/d (XTX) nu/d (XT) * nu/d (X) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nu/d (W) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 nu/d (softmax(W)) Pareto : actual and estimated nu/d (softm… view at source ↗

**Figure 9.** Figure 9: (a) Product hypothesis (50) is not satisfied by [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 9.** Figure 9: (a) Product hypothesis (52) is not satisfied by [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: (a) Patnaik-Pearson dimension for XW XT using (50). (b) Patnaik-Pearson dimension of Attention(Q, K, V ), using (52, 53) 4.6 Activation functions : ReLU We investigate the effect of the ReLU activation function. For X an N × d data manifold, define (ReLU(X))ij = ( Xij : Xij ≥ 0 0 : Xij < 0 , (56) [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 10.** Figure 10: (a) Patnaik-Pearson dimension for XW XT using (52). (b) Patnaik-Pearson dimension of Attention(Q, K, V ), using (54, 55) 0.0 0.2 0.4 0.6 0.8 1.0 nu/d X 0.0 0.2 0.4 0.6 0.8 1.0 nu/d ReLU(X) nu/d ReLU(X) vs nu/d X N = 1000, d = 1000 nu/d ReLU(X) nu/d X approx nu/d ReLU(X) 1 2 3 4 5 6 alpha(X) 1 2 3 4 5 6 alpha(ReLU(X)) alpha(ReLU(X)) vs alpha(X) N = 1000, d = 1000 alpha(ReLU(X)) alpha(X) approx alpha(ReLU(X… view at source ↗

**Figure 11.** Figure 11: The effect of ReLU on Patnaik-Pearson dimension. Approximation given by (57). [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Sum of two matrices X1 + X2 : heavy-tails dominate. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 12.** Figure 12: Sum of two matrices X1 + X2 : heavy-tails dominate. 4.7 Addition, Interpolation and Concatenation For X1 and X2 both N × d, define X1 + X2 as the usual matrix addition. As shown in [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Interpolation between two data manifolds [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 13.** Figure 13: Interpolation between two data manifolds [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Concatenation : Patnaik-Pearson dimension of [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Normalisation : Patnaik-Pearson dimension for [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 15.** Figure 15: Normalisation : Patnaik-Pearson dimension for [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Patnaik-Pearson dimension and nu/d for sampled BERT token embeddings [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 16.** Figure 16: Patnaik-Pearson dimension and nu/d for sampled BERT token embeddings [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Layerwise evolution of Patnaik-Pearson dimension for BERT. [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 17.** Figure 17: Layerwise evolution of Patnaik-Pearson dimension for BERT. [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Deepseek token embeddings: Patnaik-Pearson dimension and nu/d for samples of token [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 18.** Figure 18: Deepseek token embeddings: Patnaik-Pearson dimension and nu/d for samples of token [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: Layerwise evolution of the Patnaik-Pearson dimension of Deepseek embeddings. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 19.** Figure 19: Layerwise evolution of the Patnaik-Pearson dimension of Deepseek embeddings. [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

read the original abstract

We define a new measure of intrinsic dimension of a data manifold, which we call the Patnaik-Pearson dimension, and apply this to internal representations of neural networks, in particular transformers. The inspiration for this comes from the HTSR and SETOL work of Martin, Mahoney and Hinrichs, combined with the TwoNN intrinsic dimension estimator of Facco et al. We prove various properties of this intrinsic dimension estimator. Treating weight matrices of neural networks as data manifolds, for weight matrices whose Empirical Spectral Density follows a Pareto (Power Law) distribution, we relate the Patnaik-Pearson dimension to the HTSR and SETOL analysis, and show that critical values of the tail exponent coincide for the two approaches. Using a combination of theoretical and numerical techniques, we study the behaviour of the Patnaik-Pearson dimension of a data manifold under the transformations typical to neural networks. We apply this machinery to the BERT-base and DeepSeek-R1-Distill-Qwen-1 models, to investigate first the Patnaik-Pearson dimension of the initial data manifold of token embeddings, and second the evolution of the Patnaik-Pearson dimension as token embeddings pass through the layers of the model. Code and notebooks used for the numerical results presented here is available at https://github.com/tdhadfield/PatnaikPearson

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a Patnaik-Pearson intrinsic dimension by blending TwoNN with HTSR/SETOL ideas and shows matching critical tail exponents only when weight-matrix spectra are Pareto.

read the letter

This paper's main contribution is a new intrinsic dimension measure for neural network internals that combines the TwoNN estimator with concepts from HTSR and SETOL. It shows that under a Pareto assumption on the spectral density, the critical tail exponents line up between the two approaches.

They define the Patnaik-Pearson dimension, prove some of its properties, and examine its behavior under typical neural net transformations like those in transformers. The work includes numerical experiments on BERT-base and DeepSeek-R1-Distill-Qwen-1, tracking how the dimension evolves from token embeddings through the layers. Having the code available helps.

The relation to prior work is the strongest part when the power-law condition holds, but that's a real limitation because not all weight matrices will satisfy it. The paper states the condition, so it's not hidden, but the practical value depends on how often it applies. Without seeing the full derivations or error analysis, it's hard to judge the robustness of the numerical results.

This is for researchers interested in geometric properties of representations in deep learning. It adds a specific tool rather than solving bigger questions.

I would send it for peer review because it has a clear new definition, some theory, and real model applications with code.

Referee Report

0 major / 3 minor

Summary. The manuscript defines the Patnaik-Pearson intrinsic dimension estimator for data manifolds, drawing from HTSR/SETOL and the TwoNN method. It proves properties of the estimator, relates the dimension to HTSR/SETOL analysis specifically for weight matrices whose empirical spectral density follows a Pareto distribution (showing coincidence of critical tail exponents), examines the estimator's behavior under typical neural network transformations, and applies it to study token embeddings and their evolution through layers in BERT-base and DeepSeek-R1-Distill-Qwen-1 models. Code is provided for the numerical results.

Significance. If the conditional relation holds and the applications are robust, the work offers a new intrinsic dimension tool that connects to heavy-tailed spectral regularization frameworks, potentially aiding analysis of representation geometry in transformers. The explicit conditioning on the Pareto ESD and the availability of reproducible code are strengths.

minor comments (3)

The abstract states that the relation to HTSR/SETOL holds for weight matrices with Pareto ESD, but the applications focus on token embeddings; clarify in the introduction or §3 whether the Pareto condition is checked or relevant for the BERT/DeepSeek experiments.
The manuscript mentions proving 'various properties' of the estimator; ensure the main text explicitly lists these properties with references to the relevant theorems or propositions.
Minor notation inconsistencies may exist between the Patnaik-Pearson definition and the TwoNN baseline; verify consistency in the methods section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive assessment of the manuscript, including the recommendation for minor revision. No specific major comments are provided in the report, so there are no individual points to address point-by-point. We are happy to make any minor changes the editor may request based on the overall summary.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper defines the Patnaik-Pearson dimension by combining the TwoNN estimator with concepts from HTSR/SETOL, proves independent properties of the new estimator, and derives the relation between critical tail exponents only conditionally on the explicit assumption that ESD follows a Pareto distribution. No step equates the new dimension to a fitted input or prior result by construction, and no self-citation chain is load-bearing for the central claims. The applications to BERT and DeepSeek are presented as separate empirical uses of the defined machinery.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; the ledger is therefore provisional. The new dimension is constructed from the TwoNN estimator and the Pareto assumption on spectral density; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5772 in / 1123 out tokens · 18102 ms · 2026-07-03T23:39:29.548119+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 27 canonical work pages · 6 internal anchors

[1]

Macke, Davide Zoccolan,Intrinsic dimension of data representations in deep neural networks.https://arxiv.org/abs/1905.12784

Alessio Ansuini, Alessandro Laio, Jakob H. Macke, Davide Zoccolan,Intrinsic dimension of data representations in deep neural networks.https://arxiv.org/abs/1905.12784

work page arXiv 1905
[2]

Hari Bercovici, Vittorino Pata, Philippe Biane,Stable Laws and Domains of Attraction in Free Probability Theory.Annals of Mathematics, 1999-05, Vol.149 (3), p.1023-1060

1999
[3]

https://arxiv.org/abs/2312.14688

Nicolas Boull´ e, Alex Townsend,A Mathematical Guide to Operator Learning. https://arxiv.org/abs/2312.14688

work page arXiv
[4]

Cizeau, J

P. Cizeau, J. P. Bouchaud,Theory of Levy matrices.Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1994 Sep, 50(3):1810-1822

1994
[5]

DeepSeek-AI,DeepSeek−R1: Incentivising Reasoning Capability in LLMs via Reinforcement Learn- ing.https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek R1.pdf https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
[6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova,BERT : Pre-Training of Deep Bidi- rectional Transformers for Language Understanding.Proceedings of NAACL-HLT 2019, Minneapolis, 2-7 June 2019, 4171-7186. https://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

Riccardo Di Sipio, Jairo Diaz-Rodriguez, Luis Serrano,The Curved Spacetime of Transformer Archi- tectures.https://arxiv.org/abs/2511.03060

work page arXiv
[8]

Paul Embrechts, Claudia Kl¨ uppelberg, Thomas Mikosch,Modelling Extremal Events : for Insurance and Finance.Springer Berlin Heidelberg (2003)

2003
[9]

et al.,Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Sci Rep 7, 12140 (2017)

Facco, E., d’Errico, M., Rodriguez, A. et al.,Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Sci Rep 7, 12140 (2017). https://doi.org/10.1038/s41598-017-11873-y https://www.nature.com/articles/s41598-017-11873-y

work page doi:10.1038/s41598-017-11873-y 2017
[10]

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Aideen Fay, In´ es Garc´ ıa-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod,The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology. https://arxiv.org/abs/2505.20435

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Testing the Manifold Hypothesis

Charles Fefferman, Sanjoy Mitter, Hariharan Narayanan,Testing the Manifold Hypothesis. https://arxiv.org/abs/1310.0425

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Stephen Fitz, Peter Romero, Jiyan Jonas Schneider,Hidden Holes: Topological Aspects of Language Models.https://arxiv.org/abs/2406.05798 32

work page arXiv
[13]

Sergey Foss, Dmitry Korshunov, Stan Zachary,An Introduction to Heavy-Tailed and Subexponential Distributions.Springer (2012)

2012
[14]

In: International Conference on Learning Representations (ICLR)

Yuri Gardinazzi, Giada Panerai, Karthik Viswanathan, Alessio Ansuini, Alberto Caz- zaniga, Matteo Biagetti,Persistent Topological Features in Large Language Models. https://arxiv.org/abs/2410.11042

work page arXiv
[15]

https://arxiv.org/abs/2402.00949

Kaie Kubjas, Jiayi Li, Maximilian Wiesmann,Geometry of Polynomial Neural Networks. https://arxiv.org/abs/2402.00949

work page arXiv
[16]

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet,A mathematical perspective on Transformers.https://arxiv.org/abs/2312.10794

work page arXiv
[17]

https://arxiv.org/abs/2509.11088

Alexandros Grosdos, Elina Robeva, Maksym Zubkov,Algebraic geometry of rational neural networks. https://arxiv.org/abs/2509.11088

work page arXiv
[18]

Grubbs, Helen J

Frank E. Grubbs, Helen J. Coon, E. S. Pearson,On the Use of Patnaik Type Chi Approximations to the Range in Significance Tests.Biometrika, Vol. 53, No. 1/2 (Jun., 1966), pp. 248-252. https://doi.org/10.2307/2334073 https://www.jstor.org/stable/2334073

work page doi:10.2307/2334073 1966
[19]

https://arxiv.org/abs/2204.08624

German Magai, Anton Ayzenberg,Topology and Geometry of Data Manifold in Deep Learning. https://arxiv.org/abs/2204.08624

work page arXiv
[20]

Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl´ en Kohn, An Invitation to Neuroalgebraic Geometry.https://arxiv.org/abs/2501.18915v1

work page arXiv
[21]

Martin,WeightWatcher.https://github.com/CalculatedContent/WeightWatcher

Charles H. Martin,WeightWatcher.https://github.com/CalculatedContent/WeightWatcher
[22]

Martin,A Spectral Renormalization-Group View of Learning

Charles H. Martin,A Spectral Renormalization-Group View of Learning. https://www.linkedin.com/posts/charlesmartin14 a-spectral-renormalization-group-view-of- ugcPost-7471078735861944321-jXs
[23]

Martin, Christopher Hinrichs,SETOL: A Semi-Empirical Theory of (Deep) Learning

Charles H. Martin, Christopher Hinrichs,SETOL: A Semi-Empirical Theory of (Deep) Learning. https://arxiv.org/abs/2507.17912

work page arXiv
[24]

Traditional and Heavy-Tailed Self Regularization in Neural Network Models

Charles H. Martin, Michael W. Mahoney,Traditional and Heavy-Tailed Self Regularization in Neural Network Models.https://arxiv.org/abs/1901.08276

work page internal anchor Pith review Pith/arXiv arXiv 1901
[25]

Govind Menon,The geometry of the deep linear network.https://arxiv.org/abs/2411.09004

work page arXiv
[26]

Thomas Mikosch, Olivier Wintenberger,Extreme Value Theory for Time Series Models with Power- Law Tails.Springer, 2024

2024
[27]

https://papers.ssrn.com/sol3/papers.cfm?abstract id=6073468

Miquel Noguer I Alonso,The Complete Mathematics of Transformers: A Rigorous Treatment with Full Derivations, Proofs, and Theoretical Foundations. https://papers.ssrn.com/sol3/papers.cfm?abstract id=6073468
[28]

Eng-Jon Ong, Omer Bobrowski, Gesine Reinert, Primoz Skraba,A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality.https://arxiv.org/abs/2603.10493v1

work page arXiv
[29]

B.,The non-centralχ 2- and F-distributions and their application.Biometrika, 36(1/2) (1949), 202-232

Patnaik, P. B.,The non-centralχ 2- and F-distributions and their application.Biometrika, 36(1/2) (1949), 202-232. https://doi.org/10.2307/2332542 33

work page doi:10.2307/2332542 1949
[30]

S.,Note on an approximation to the distribution of noncentralχ 2.Biometrika, 46(3/4), (1959) 364

Pearson, E. S.,Note on an approximation to the distribution of noncentralχ 2.Biometrika, 46(3/4), (1959) 364. https://doi.org/10.2307/2333533

work page doi:10.2307/2333533 1959
[31]

Marc Potters, Jean-Philippe Bouchaud,A first course in Random Matrix Theory : for physicists, engineers and data scientists.Cambridge University Press (2021)

2021
[32]

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra,Grokking: Generaliza- tion Beyond Overfitting on Small Algorithmic Datasets.https://arxiv.org/abs/2201.02177

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Prakash, Charles H

Hari K. Prakash, Charles H. Martin,Grokking and Generalization Collapse: Insights fromHTSR theory.https://arxiv.org/abs/2506.04434

work page arXiv
[34]

Philippe Rigollet,The Mean-Field Dynamics of Transformers.https://arxiv.org/abs/2512.01868

work page arXiv
[35]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin,Attention Is All You Need.https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Voiculescu, K

D. Voiculescu, K. Dykema, A. Nica,Free random variables.CRM Monograph Series, No. 1, A.M.S., Providence, RI, 1992

1992
[37]

https://arxiv.org/abs/2007.02876

James Vuckovic, Aristide Baratin, Remi Tachet des Combes,A Mathematical Theory of Attention. https://arxiv.org/abs/2007.02876

work page arXiv 2007
[38]

Suppose we have a region inR d, containing random points with (uniform) densityρ

Nick Whiteley, Annie Gray, Patrick Rubin-Delanchy,Statistical exploration of the Manifold Hypoth- esis.https://arxiv.org/abs/2208.11665v3 8 Appendix : The TwoNN intrinsic dimension formula We derive the formulae given in [9] which we use in Section 2.3, in particular (9). Suppose we have a region inR d, containing random points with (uniform) densityρ. Th...

work page arXiv

[1] [1]

Macke, Davide Zoccolan,Intrinsic dimension of data representations in deep neural networks.https://arxiv.org/abs/1905.12784

Alessio Ansuini, Alessandro Laio, Jakob H. Macke, Davide Zoccolan,Intrinsic dimension of data representations in deep neural networks.https://arxiv.org/abs/1905.12784

work page arXiv 1905

[2] [2]

Hari Bercovici, Vittorino Pata, Philippe Biane,Stable Laws and Domains of Attraction in Free Probability Theory.Annals of Mathematics, 1999-05, Vol.149 (3), p.1023-1060

1999

[3] [3]

https://arxiv.org/abs/2312.14688

Nicolas Boull´ e, Alex Townsend,A Mathematical Guide to Operator Learning. https://arxiv.org/abs/2312.14688

work page arXiv

[4] [4]

Cizeau, J

P. Cizeau, J. P. Bouchaud,Theory of Levy matrices.Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1994 Sep, 50(3):1810-1822

1994

[5] [5]

DeepSeek-AI,DeepSeek−R1: Incentivising Reasoning Capability in LLMs via Reinforcement Learn- ing.https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek R1.pdf https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

[6] [6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova,BERT : Pre-Training of Deep Bidi- rectional Transformers for Language Understanding.Proceedings of NAACL-HLT 2019, Minneapolis, 2-7 June 2019, 4171-7186. https://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

Riccardo Di Sipio, Jairo Diaz-Rodriguez, Luis Serrano,The Curved Spacetime of Transformer Archi- tectures.https://arxiv.org/abs/2511.03060

work page arXiv

[8] [8]

Paul Embrechts, Claudia Kl¨ uppelberg, Thomas Mikosch,Modelling Extremal Events : for Insurance and Finance.Springer Berlin Heidelberg (2003)

2003

[9] [9]

et al.,Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Sci Rep 7, 12140 (2017)

Facco, E., d’Errico, M., Rodriguez, A. et al.,Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Sci Rep 7, 12140 (2017). https://doi.org/10.1038/s41598-017-11873-y https://www.nature.com/articles/s41598-017-11873-y

work page doi:10.1038/s41598-017-11873-y 2017

[10] [10]

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Aideen Fay, In´ es Garc´ ıa-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod,The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology. https://arxiv.org/abs/2505.20435

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Testing the Manifold Hypothesis

Charles Fefferman, Sanjoy Mitter, Hariharan Narayanan,Testing the Manifold Hypothesis. https://arxiv.org/abs/1310.0425

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Stephen Fitz, Peter Romero, Jiyan Jonas Schneider,Hidden Holes: Topological Aspects of Language Models.https://arxiv.org/abs/2406.05798 32

work page arXiv

[13] [13]

Sergey Foss, Dmitry Korshunov, Stan Zachary,An Introduction to Heavy-Tailed and Subexponential Distributions.Springer (2012)

2012

[14] [14]

In: International Conference on Learning Representations (ICLR)

Yuri Gardinazzi, Giada Panerai, Karthik Viswanathan, Alessio Ansuini, Alberto Caz- zaniga, Matteo Biagetti,Persistent Topological Features in Large Language Models. https://arxiv.org/abs/2410.11042

work page arXiv

[15] [15]

https://arxiv.org/abs/2402.00949

Kaie Kubjas, Jiayi Li, Maximilian Wiesmann,Geometry of Polynomial Neural Networks. https://arxiv.org/abs/2402.00949

work page arXiv

[16] [16]

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet,A mathematical perspective on Transformers.https://arxiv.org/abs/2312.10794

work page arXiv

[17] [17]

https://arxiv.org/abs/2509.11088

Alexandros Grosdos, Elina Robeva, Maksym Zubkov,Algebraic geometry of rational neural networks. https://arxiv.org/abs/2509.11088

work page arXiv

[18] [18]

Grubbs, Helen J

Frank E. Grubbs, Helen J. Coon, E. S. Pearson,On the Use of Patnaik Type Chi Approximations to the Range in Significance Tests.Biometrika, Vol. 53, No. 1/2 (Jun., 1966), pp. 248-252. https://doi.org/10.2307/2334073 https://www.jstor.org/stable/2334073

work page doi:10.2307/2334073 1966

[19] [19]

https://arxiv.org/abs/2204.08624

German Magai, Anton Ayzenberg,Topology and Geometry of Data Manifold in Deep Learning. https://arxiv.org/abs/2204.08624

work page arXiv

[20] [20]

Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, Kathl´ en Kohn, An Invitation to Neuroalgebraic Geometry.https://arxiv.org/abs/2501.18915v1

work page arXiv

[21] [21]

Martin,WeightWatcher.https://github.com/CalculatedContent/WeightWatcher

Charles H. Martin,WeightWatcher.https://github.com/CalculatedContent/WeightWatcher

[22] [22]

Martin,A Spectral Renormalization-Group View of Learning

Charles H. Martin,A Spectral Renormalization-Group View of Learning. https://www.linkedin.com/posts/charlesmartin14 a-spectral-renormalization-group-view-of- ugcPost-7471078735861944321-jXs

[23] [23]

Martin, Christopher Hinrichs,SETOL: A Semi-Empirical Theory of (Deep) Learning

Charles H. Martin, Christopher Hinrichs,SETOL: A Semi-Empirical Theory of (Deep) Learning. https://arxiv.org/abs/2507.17912

work page arXiv

[24] [24]

Traditional and Heavy-Tailed Self Regularization in Neural Network Models

Charles H. Martin, Michael W. Mahoney,Traditional and Heavy-Tailed Self Regularization in Neural Network Models.https://arxiv.org/abs/1901.08276

work page internal anchor Pith review Pith/arXiv arXiv 1901

[25] [25]

Govind Menon,The geometry of the deep linear network.https://arxiv.org/abs/2411.09004

work page arXiv

[26] [26]

Thomas Mikosch, Olivier Wintenberger,Extreme Value Theory for Time Series Models with Power- Law Tails.Springer, 2024

2024

[27] [27]

https://papers.ssrn.com/sol3/papers.cfm?abstract id=6073468

Miquel Noguer I Alonso,The Complete Mathematics of Transformers: A Rigorous Treatment with Full Derivations, Proofs, and Theoretical Foundations. https://papers.ssrn.com/sol3/papers.cfm?abstract id=6073468

[28] [28]

Eng-Jon Ong, Omer Bobrowski, Gesine Reinert, Primoz Skraba,A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality.https://arxiv.org/abs/2603.10493v1

work page arXiv

[29] [29]

B.,The non-centralχ 2- and F-distributions and their application.Biometrika, 36(1/2) (1949), 202-232

Patnaik, P. B.,The non-centralχ 2- and F-distributions and their application.Biometrika, 36(1/2) (1949), 202-232. https://doi.org/10.2307/2332542 33

work page doi:10.2307/2332542 1949

[30] [30]

S.,Note on an approximation to the distribution of noncentralχ 2.Biometrika, 46(3/4), (1959) 364

Pearson, E. S.,Note on an approximation to the distribution of noncentralχ 2.Biometrika, 46(3/4), (1959) 364. https://doi.org/10.2307/2333533

work page doi:10.2307/2333533 1959

[31] [31]

Marc Potters, Jean-Philippe Bouchaud,A first course in Random Matrix Theory : for physicists, engineers and data scientists.Cambridge University Press (2021)

2021

[32] [32]

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra,Grokking: Generaliza- tion Beyond Overfitting on Small Algorithmic Datasets.https://arxiv.org/abs/2201.02177

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Prakash, Charles H

Hari K. Prakash, Charles H. Martin,Grokking and Generalization Collapse: Insights fromHTSR theory.https://arxiv.org/abs/2506.04434

work page arXiv

[34] [34]

Philippe Rigollet,The Mean-Field Dynamics of Transformers.https://arxiv.org/abs/2512.01868

work page arXiv

[35] [35]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin,Attention Is All You Need.https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Voiculescu, K

D. Voiculescu, K. Dykema, A. Nica,Free random variables.CRM Monograph Series, No. 1, A.M.S., Providence, RI, 1992

1992

[37] [37]

https://arxiv.org/abs/2007.02876

James Vuckovic, Aristide Baratin, Remi Tachet des Combes,A Mathematical Theory of Attention. https://arxiv.org/abs/2007.02876

work page arXiv 2007

[38] [38]

Suppose we have a region inR d, containing random points with (uniform) densityρ

Nick Whiteley, Annie Gray, Patrick Rubin-Delanchy,Statistical exploration of the Manifold Hypoth- esis.https://arxiv.org/abs/2208.11665v3 8 Appendix : The TwoNN intrinsic dimension formula We derive the formulae given in [9] which we use in Section 2.3, in particular (9). Suppose we have a region inR d, containing random points with (uniform) densityρ. Th...

work page arXiv