Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective
Pith reviewed 2026-05-22 10:05 UTC · model grok-4.3
The pith
The Representation Gap in neural networks is asymptotically governed by the task's intrinsic dimension.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the Representation Gap and derive its precise asymptotic equivalent, demonstrating that the quantity is governed by a single interpretable parameter, the intrinsic dimension of the task, which connects directly to the equivariances built into common neural network architectures and extends beyond diffusion models to other training dynamics.
What carries the argument
The Representation Gap, a quantity derived from representation learning that admits exact asymptotic analysis through optimal quantization and point-process theory.
If this is right
- Generalization error can be predicted from an efficiently estimated intrinsic dimension without full training runs.
- Architectures whose equivariances match the task's intrinsic dimension should systematically reduce the Representation Gap.
- The asymptotic law applies to a broader set of training algorithms and model families beyond equivariant diffusion models.
- Empirical accuracy of the law and the dimension estimator holds on both synthetic data with known geometry and on realistic datasets.
Where Pith is reading between the lines
- Designers could estimate intrinsic dimension on new data to decide whether a given architecture family is likely to succeed before training.
- The connection to equivariances suggests a route to test whether adding or removing symmetries in a model changes the effective dimension in the predicted way.
- If the single-parameter control fails for non-equivariant models, that would mark a clear boundary for when the geometric explanation applies.
Load-bearing premise
Results from optimal quantization and point-process theory can be applied directly to the training dynamics and representation learning of equivariant diffusion models and similar tasks.
What would settle it
Measuring the Representation Gap on a dataset with an independently known intrinsic dimension and finding that the observed values deviate systematically from the predicted asymptotic formula would falsify the central claim.
Figures
read the original abstract
Characterizing precisely the asymptotic generalization error of neural networks using parameters that can be estimated efficiently is a crucial problem in machine learning, which relies heavily on heuristics and practitioners' intuition to make key design choices. In order to mitigate this issue, we introduce the Representation Gap, a metric closely related to the generalization error, but admitting better-behaved asymptotic dynamics. Focusing on equivariant diffusion models and leveraging results from optimal quantization and point-process theory, we derive a precise asymptotic equivalent of the Representation Gap and show that it is governed by a single parameter, the \textit{intrinsic dimension} of the task, which is easy to interpret, efficient to estimate, and can be linked to the equivariances of common neural network architectures. We show that this asymptotic dynamic also extends to a broader range of tasks and training algorithms. Finally, we demonstrate empirically that our asymptotic law and intrinsic dimension estimation are accurate on a wide range of synthetic datasets, where these quantities are known, as well as on more realistic datasets, where we obtain results consistent with the related literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Representation Gap, a metric related to generalization error that admits better asymptotic behavior. Focusing on equivariant diffusion models, it imports results from optimal quantization and point-process theory to derive a precise asymptotic equivalent governed by a single parameter—the intrinsic dimension of the task—which is claimed to be interpretable, efficiently estimable, and linked to common architectural equivariances. The asymptotic law is asserted to extend to broader tasks and training algorithms, with empirical support on synthetic datasets (where ground truth is known) and realistic datasets (where results align with prior literature).
Significance. If the central derivation is rigorous and the transfer from quantization limits to learned representations is justified without unverified analogies, the result would supply a geometric, single-parameter account of neural network effectiveness that is directly tied to task geometry and architecture choices. This could be valuable for guiding design and for providing falsifiable predictions, especially given the emphasis on efficient estimation of the governing parameter.
major comments (2)
- [Theoretical derivation (main body, around the statement of the asymptotic equivalent)] The central derivation imports asymptotic results from optimal quantization and point-process theory to characterize the Representation Gap under equivariant diffusion training. However, the manuscript does not appear to supply an explicit coupling or invariance argument establishing that the empirical measure of features produced by SGD on the diffusion loss converges to the same intensity measures and Palm distributions as the quantization optimum. This step is load-bearing for the reduction to a single intrinsic-dimension parameter and for the claimed link to equivariances.
- [Section on broader applicability] The extension of the asymptotic law to 'a broader range of tasks and training algorithms' is stated in the abstract and presumably justified later, but the conditions under which the point-process statistics remain valid outside the equivariant diffusion setting are not clearly delimited. Without these conditions, the generality claim risks overreach relative to the derivation's scope.
minor comments (2)
- [Introduction / Notation] Notation for the Representation Gap and its asymptotic equivalent should be introduced with a clear definition before the main theorem, including any dependence on the empirical measure.
- [Experiments] The empirical validation section would benefit from explicit reporting of how the intrinsic dimension is estimated in practice (e.g., which estimator is used and its sensitivity) on both synthetic and realistic datasets.
Simulated Author's Rebuttal
We are grateful to the referee for the careful reading and valuable suggestions. Below we provide detailed responses to the major comments and indicate the revisions we plan to make to address them.
read point-by-point responses
-
Referee: [Theoretical derivation (main body, around the statement of the asymptotic equivalent)] The central derivation imports asymptotic results from optimal quantization and point-process theory to characterize the Representation Gap under equivariant diffusion training. However, the manuscript does not appear to supply an explicit coupling or invariance argument establishing that the empirical measure of features produced by SGD on the diffusion loss converges to the same intensity measures and Palm distributions as the quantization optimum. This step is load-bearing for the reduction to a single intrinsic-dimension parameter and for the claimed link to equivariances.
Authors: We thank the referee for pointing out this important aspect of the theoretical derivation. The manuscript does rely on the asymptotic equivalence from quantization theory, assuming that the learned representations under equivariant diffusion training achieve similar point process statistics. However, we recognize that an explicit argument establishing the convergence of the SGD empirical measure to the optimal quantization intensity measures via invariance or coupling was not provided in detail. This is a valid concern as it underpins the single-parameter dependence on intrinsic dimension. In the revised version, we will include an additional lemma or subsection that outlines the coupling argument, leveraging the equivariance properties and the structure of the diffusion objective to justify why the Palm distributions align. revision: yes
-
Referee: [Section on broader applicability] The extension of the asymptotic law to 'a broader range of tasks and training algorithms' is stated in the abstract and presumably justified later, but the conditions under which the point-process statistics remain valid outside the equivariant diffusion setting are not clearly delimited. Without these conditions, the generality claim risks overreach relative to the derivation's scope.
Authors: We appreciate the referee's caution regarding the scope of the generality claim. The extension is motivated by the geometric interpretation of the Representation Gap, which suggests that similar asymptotic behavior may hold whenever the task has a well-defined intrinsic dimension and the training induces representations that can be analyzed via point processes. Nevertheless, we agree that without explicit conditions, the claim could be seen as overreaching. In the revised manuscript, we will revise the relevant section to clearly delimit the conditions (e.g., requiring approximate equivariance or geometric regularity in the data manifold) under which the point-process statistics are expected to hold, and we will adjust the abstract and conclusions to reflect this more precise scope. revision: yes
Circularity Check
No circularity: asymptotic derivation applies pre-existing quantization results independently to NN representations
full rationale
The paper introduces the Representation Gap metric and derives its asymptotic equivalent by directly leveraging established results from optimal quantization and point-process theory, which are independent of the neural network training context. The single-parameter reduction to intrinsic dimension follows from those external limits rather than from fitting or redefining the target quantity within the paper. Empirical checks on synthetic datasets (where ground truth is known) and realistic data provide external falsifiability. No self-citation chain, self-definitional loop, or fitted-input-as-prediction pattern appears in the derivation; the equivariance link is a downstream consequence, not a load-bearing assumption that collapses the claim. The overall chain remains self-contained against external mathematical benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Optimal quantization and point-process theory results apply to the asymptotic dynamics of equivariant diffusion models and neural network representations.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the representation gap reduces to a quantization problem on the quotient manifold Ω/G … Rn ∼ Jd / n^{2/d} where d = d_Ω/G
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_add / orbit structure under generator echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
equivariant model virtually augments the dataset … Ωf = G(D)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
asymptotic equivalent … governed by a single parameter, the intrinsic dimension
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Volume of small balls and sub-Riemannian curvature in 3D contact manifolds
Davide Barilari, Ivan Beschastnyi, and Antonio Lerario. V olume of small balls and sub-riemannian curvature in 3d contact manifolds.arXiv preprint arXiv:1802.10155,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
ISSN 1939-3539. doi: 10.1109/TPAMI.2013.50. URL https://ieeexplore. ieee.org/document/6472238/. Gérard Biau and Luc Devroye.Lectures on the nearest neighbor method, volume
-
[3]
doi: 10.1051/epjconf/202225809001
ISSN 2100-014X. doi: 10.1051/epjconf/202225809001. Shuxiao Chen, Edgar Dobriban, and Jane H Lee. A group-theoretic framework for data augmentation. Journal of Machine Learning Research,
-
[4]
ISSN 0894- 0347, 1088-6834. doi: 10.1090/jams/852. URL https://www.ams.org/jams/2016-29-04/ S0894-0347-2016-00852-4/. Emma Finn, T. Anderson Keller, Manos Theodosis, and Demba E. Ba. Origins of creativity in attention-based diffusion models. InHiLD at ICML
-
[5]
doi: 10.48550/arXiv.2506. 17324. URLhttp://arxiv.org/abs/2506.17324. S. Gallot, D. Hulin, and J. Lafontaine.Riemannian geometry. Universitext. Springer-Verlag, Berlin ; New York, 2nd ed edition,
-
[6]
ISBN 978-1-7281-3293-8. doi: 10.1109/CVPR.2019. 00411. URLhttps://ieeexplore.ieee.org/document/8953348/. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, June
-
[7]
Approximation capabilities of multilayer feedforward networks
ISSN 0893-6080. doi: 10.1016/0893-6080(91)90009-T. URL https: //linkinghub.elsevier.com/retrieve/pii/089360809190009T. Mikaela Iacobelli. Asymptotic quantization for probability measures on riemannian manifolds. ESAIM: Control, Optimisation and Calculus of Variations, 22(3):770–785,
-
[8]
Staying on the manifold: Geometry-aware noise injection.arXiv preprint arXiv:2509.20201,
11 Albert Kjøller Jacobsen, Johanna Marie Gegenfurtner, and Georgios Arvanitidis. Staying on the manifold: Geometry-aware noise injection.arXiv preprint arXiv:2509.20201,
-
[9]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020a. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario...
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[10]
Gradient-based learning applied to document recognition,
ISSN 1558-2256. doi: 10.1109/5.726791. URLhttps://ieeexplore.ieee.org/document/726791/. John M. Lee.Riemannian Manifolds: An Introduction to Curvature. Springer Science & Business Media, April
-
[11]
Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024
ISBN 978-0-387-22726-9. Google-Books-ID: 92PgBwAAQBAJ. Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, and Stefano Soatto. On the scalability of diffusion-based text-to- image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9400–940...
-
[12]
Fleet d, pajdla t, schiele b, tuytelaars t, et al
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Fleet d, pajdla t, schiele b, tuytelaars t, et al. microsoft coco: common objects in context.Computer Vision–ECCV 2014, pages 740–755, 2014a. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr...
work page 2014
-
[13]
ISSN 1099-4300. doi: 10.3390/e23111403. URL https://www.mdpi.com/1099-4300/23/ 11/1403. Publisher: Multidisciplinary Digital Publishing Institute. Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press,
-
[14]
There Will Be a Scientific Theory of Deep Learning
Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, et al. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Emogen: Emotional image content generation with text-to-image diffusion models,
doi: 10.1109/CVPR52733.2024.00458. URL https://ieeexplore.ieee.org/document/ 10658325/. Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models.arXiv preprint arXiv:1511.01844,
-
[16]
13 Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. InCVPR 2011, pages 1521–1528. IEEE,
work page 2011
-
[17]
PoseNet: A convolutional network for real-time 6-dof camera relocalization,
IEEE. ISBN 978-1-7281-4803-8. doi: 10.1109/ICCV .2019.00464. URL https: //ieeexplore.ieee.org/document/9010395/. Jiachen Yao, Mayank Goswami, and Chao Chen. A theoretical study of neural network expressive power via manifold topology, October
-
[18]
ISSN 0001-0782, 1557-7317. doi: 10.1145/3446776. 14 A Notations A.1 Task and geometry Manifold.We consider a supervised task, with input space X ⊂ R dX and target space Y ⊂ R dY . We assume that the observations(x, y) belong to a subset Ω⊂ X × Y , which models the structure of the task and its underlying symmetries. Following the manifold hypothesis (Beng...
-
[19]
We denote by 1[E] the indicator function of a set E
the evaluation of its density at a point y. We denote by 1[E] the indicator function of a set E. For a finite set E, we denote by |E| its cardinality. IfEis measurable,|E|denotes its Lebesgue measure, and ˚Eits interior. Asymptotic notation.We denote by an ∼b n the deterministic asymptotic equivalence an bn →1 . Similarly, we write Xn ∼P an when Xn an →1 ...
work page 2022
-
[20]
(see Section E.2). Given a budget of n points, we are often interested in the best approximation achievable by a discrete set of size n. This quantity, known as the optimal quantization error or optimal quantization risk (Graf and Luschgy, 2007), is defined as Rn(P) = inf z∈Y n Z Y min k∈[ [1,n] ] ℓ(y, zk)p(y)dy.(18) A central tool of our analysis is Zado...
work page 2007
-
[21]
under a common geometric framework. Despite these similarities, Theorem 8 and Zador’s theorem describe fundamentally different types of asymptotic results. Theorem 8 characterizes the quantization error averaged over alli.i.d.datasets of size n, whereas Zador’s theorem characterizes the quantization error of a specific (optimal) point configuration. From ...
work page 2020
-
[22]
or normalizing flows (Rezende and Mohamed, 2016). Among them, diffusion models can be shown to converge toward the empirical distribution 1 |D| P y∈D δy when they minimize their training objective (Song and Ermon, 2019). We will focus on this class of models hereafter. In this case, the empirical distribution corresponds to the prediction space Ωf learned...
work page 2016
-
[23]
Most notably, the leading constant depends on the geometry of Ω only via a volume termV ∗ d(p)
This result is remarkable, since it provides an asymptotic equivalent of the representation gap as the dataset size n grows to infinity. Most notably, the leading constant depends on the geometry of Ω only via a volume termV ∗ d(p). C.3 Asymptotic representation gap under the manifold hypothesis It is possible to extend this result when Ω is a low-dimensi...
work page 2001
-
[24]
Informally, the proof of Lemma 1 then relies on the two following approximations: I(t) = Z G(D) h(z)e−∥z− y αt ∥2/βtdz≈ Z G(D) h(z)e−∥z−y∥2/βtdz≈h(y ∗)e−∥y∗−y∥2/βt(2πβt)d/2 . The first approximation comes from integrating ∥z− y αt ∥2 =∥z−y∥ 2 +O(β t) over the orbit G(D), and the second approximation is an extension of Laplace approximation on measurable s...
work page 2010
-
[25]
By theorem B.3 in Kamb and Ganguli (2025), the score function by the modelfcan be written f(y t, t) =− 1 1−α t R G(D)(y− √αtz)N(y| √αtz,(1−α t)I)dz R G(D) N(y| √αtz,(1−α t)I)dz = 1 1−α t (yt−y∗ t )+o 1 1−α t , (32) where the second equality is a corollary of Lemma 1 to be justified later. Then, hypothesis(iv) implies that γtf(y t, t) =∂ tyt +γ tyt is boun...
work page 2025
-
[26]
Therefore, we can identify its prediction space Ωf with G(D)
Proposition 4 establishes that an equivariant diffusion modelf generates samples in G(D). Therefore, we can identify its prediction space Ωf with G(D). If the symmetry group G enforced by the architecture is aligned with the symmetries of the manifold Ω, then the effective dimension of the learning problem is reduced fromd Ω tod Ω/G. 22 Proposition 5(Repr...
work page 1990
-
[27]
E Link with related work In this section, we clarify the relations of the concept introduced in this article with several related works. 24 E.1 Generalization error A natural question is to relate the representation gap R(Ω,Ω f) to the generalization error (Shalev- Shwartz and Ben-David, 2014), commonly used to characterize generalization. We focus on the...
work page 2014
-
[28]
E.2 Wasserstein distance Wasserstein distance Peyré et al
Generalization error and representation gap are therefore closely related. E.2 Wasserstein distance Wasserstein distance Peyré et al. (2019) is typically used to measure neural network generalization (Theis et al., 2015). Interestingly, we can see that the representation gapR(Ω,Ω f) is a particular case of the Wasserstein distance W(Ω,Ω f), where each poi...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.