Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective

David Perera; Flavio Figueiredo; Lais Isabelle Alves dos Santos; Michel F. C. Haddad; Victor Moura

arxiv: 2605.21692 · v1 · pith:6ZBZ5ENKnew · submitted 2026-05-20 · 💻 cs.LG · stat.ML

Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective

David Perera , Victor Moura , Lais Isabelle Alves dos Santos , Michel F. C. Haddad , Flavio Figueiredo This is my paper

Pith reviewed 2026-05-22 10:05 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords representation gapintrinsic dimensiongeneralization errorequivariant diffusion modelsasymptotic analysisoptimal quantizationpoint process theory

0 comments

The pith

The Representation Gap in neural networks is asymptotically governed by the task's intrinsic dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Representation Gap as a metric tied to generalization error but with cleaner asymptotic behavior. Drawing on optimal quantization and point-process theory for equivariant diffusion models, the authors derive an exact asymptotic form controlled by one parameter: the intrinsic dimension of the task. This dimension is straightforward to interpret, cheap to estimate, and aligns with the symmetry properties that standard neural architectures exploit. The same asymptotic law is shown to hold for a wider collection of tasks and training methods, with empirical checks on both synthetic data (where dimensions are known) and realistic datasets.

Core claim

We introduce the Representation Gap and derive its precise asymptotic equivalent, demonstrating that the quantity is governed by a single interpretable parameter, the intrinsic dimension of the task, which connects directly to the equivariances built into common neural network architectures and extends beyond diffusion models to other training dynamics.

What carries the argument

The Representation Gap, a quantity derived from representation learning that admits exact asymptotic analysis through optimal quantization and point-process theory.

If this is right

Generalization error can be predicted from an efficiently estimated intrinsic dimension without full training runs.
Architectures whose equivariances match the task's intrinsic dimension should systematically reduce the Representation Gap.
The asymptotic law applies to a broader set of training algorithms and model families beyond equivariant diffusion models.
Empirical accuracy of the law and the dimension estimator holds on both synthetic data with known geometry and on realistic datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could estimate intrinsic dimension on new data to decide whether a given architecture family is likely to succeed before training.
The connection to equivariances suggests a route to test whether adding or removing symmetries in a model changes the effective dimension in the predicted way.
If the single-parameter control fails for non-equivariant models, that would mark a clear boundary for when the geometric explanation applies.

Load-bearing premise

Results from optimal quantization and point-process theory can be applied directly to the training dynamics and representation learning of equivariant diffusion models and similar tasks.

What would settle it

Measuring the Representation Gap on a dataset with an independently known intrinsic dimension and finding that the observed values deviate systematically from the predicted asymptotic formula would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.21692 by David Perera, Flavio Figueiredo, Lais Isabelle Alves dos Santos, Michel F. C. Haddad, Victor Moura.

**Figure 2.** Figure 2: Log plot of the asymptotic evolution of the representation gap of a rotationequivariant model and a non-equivariant model for a 2D sphere surface. The x-axis corresponds to the dataset size n, and the yaxis corresponds to the representation gap. We observe a linear evolution, with slope −1 for the non-equivariant model and −2 for the equivariant model. The theoretical curves are shown using an empirical… view at source ↗

**Figure 3.** Figure 3: Virtual augmentation of a dataset in the non-conditional setting (Hypercube dataset) and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Asymptotic behavior of the representation gap across the two datasets of [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Characterizing precisely the asymptotic generalization error of neural networks using parameters that can be estimated efficiently is a crucial problem in machine learning, which relies heavily on heuristics and practitioners' intuition to make key design choices. In order to mitigate this issue, we introduce the Representation Gap, a metric closely related to the generalization error, but admitting better-behaved asymptotic dynamics. Focusing on equivariant diffusion models and leveraging results from optimal quantization and point-process theory, we derive a precise asymptotic equivalent of the Representation Gap and show that it is governed by a single parameter, the \textit{intrinsic dimension} of the task, which is easy to interpret, efficient to estimate, and can be linked to the equivariances of common neural network architectures. We show that this asymptotic dynamic also extends to a broader range of tasks and training algorithms. Finally, we demonstrate empirically that our asymptotic law and intrinsic dimension estimation are accurate on a wide range of synthetic datasets, where these quantities are known, as well as on more realistic datasets, where we obtain results consistent with the related literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a Representation Gap metric and reduces its asymptotics to intrinsic dimension using quantization theory, with decent empirical checks, but the key transfer from point-process limits to actual training dynamics is assumed rather than derived.

read the letter

The main thing here is that the authors introduce the Representation Gap as a proxy for generalization error in equivariant diffusion models and claim it has a clean asymptotic form controlled by a single parameter: the intrinsic dimension of the task. They pull in tools from optimal quantization and point-process theory to get the reduction and then tie it to architectural equivariances. They also say the same law extends beyond diffusion models. On the positive side, the empirical section looks useful. They test on synthetic data where the true intrinsic dimension is known and recover it accurately, then move to realistic datasets and get numbers that line up with existing literature on effective dimensions. That kind of check is worth having if you're trying to estimate how much data or capacity a model really needs. The soft spot is the theoretical step that carries the main claim. The derivation imports the asymptotic error formulas from quantization but does not show why the representations produced by SGD on the diffusion loss actually converge to the same limiting point process, including the right intensity measures. Equivariance is mentioned as a bridge to architecture, yet there's no explicit invariance or coupling argument that preserves the necessary ergodicity under the training dynamics. Without that, the single-parameter result stays conditional on an unverified analogy. This paper is for readers who work on geometric or measure-theoretic accounts of generalization and who care about intrinsic dimension as a practical knob. Someone doing architecture search or data-efficiency studies could extract ideas from the empirical side. It deserves peer review. The experiments give something concrete to evaluate, and the core idea is clear enough that referees can focus on tightening the derivation rather than starting from scratch.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Representation Gap, a metric related to generalization error that admits better asymptotic behavior. Focusing on equivariant diffusion models, it imports results from optimal quantization and point-process theory to derive a precise asymptotic equivalent governed by a single parameter—the intrinsic dimension of the task—which is claimed to be interpretable, efficiently estimable, and linked to common architectural equivariances. The asymptotic law is asserted to extend to broader tasks and training algorithms, with empirical support on synthetic datasets (where ground truth is known) and realistic datasets (where results align with prior literature).

Significance. If the central derivation is rigorous and the transfer from quantization limits to learned representations is justified without unverified analogies, the result would supply a geometric, single-parameter account of neural network effectiveness that is directly tied to task geometry and architecture choices. This could be valuable for guiding design and for providing falsifiable predictions, especially given the emphasis on efficient estimation of the governing parameter.

major comments (2)

[Theoretical derivation (main body, around the statement of the asymptotic equivalent)] The central derivation imports asymptotic results from optimal quantization and point-process theory to characterize the Representation Gap under equivariant diffusion training. However, the manuscript does not appear to supply an explicit coupling or invariance argument establishing that the empirical measure of features produced by SGD on the diffusion loss converges to the same intensity measures and Palm distributions as the quantization optimum. This step is load-bearing for the reduction to a single intrinsic-dimension parameter and for the claimed link to equivariances.
[Section on broader applicability] The extension of the asymptotic law to 'a broader range of tasks and training algorithms' is stated in the abstract and presumably justified later, but the conditions under which the point-process statistics remain valid outside the equivariant diffusion setting are not clearly delimited. Without these conditions, the generality claim risks overreach relative to the derivation's scope.

minor comments (2)

[Introduction / Notation] Notation for the Representation Gap and its asymptotic equivalent should be introduced with a clear definition before the main theorem, including any dependence on the empirical measure.
[Experiments] The empirical validation section would benefit from explicit reporting of how the intrinsic dimension is estimated in practice (e.g., which estimator is used and its sensitivity) on both synthetic and realistic datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the careful reading and valuable suggestions. Below we provide detailed responses to the major comments and indicate the revisions we plan to make to address them.

read point-by-point responses

Referee: [Theoretical derivation (main body, around the statement of the asymptotic equivalent)] The central derivation imports asymptotic results from optimal quantization and point-process theory to characterize the Representation Gap under equivariant diffusion training. However, the manuscript does not appear to supply an explicit coupling or invariance argument establishing that the empirical measure of features produced by SGD on the diffusion loss converges to the same intensity measures and Palm distributions as the quantization optimum. This step is load-bearing for the reduction to a single intrinsic-dimension parameter and for the claimed link to equivariances.

Authors: We thank the referee for pointing out this important aspect of the theoretical derivation. The manuscript does rely on the asymptotic equivalence from quantization theory, assuming that the learned representations under equivariant diffusion training achieve similar point process statistics. However, we recognize that an explicit argument establishing the convergence of the SGD empirical measure to the optimal quantization intensity measures via invariance or coupling was not provided in detail. This is a valid concern as it underpins the single-parameter dependence on intrinsic dimension. In the revised version, we will include an additional lemma or subsection that outlines the coupling argument, leveraging the equivariance properties and the structure of the diffusion objective to justify why the Palm distributions align. revision: yes
Referee: [Section on broader applicability] The extension of the asymptotic law to 'a broader range of tasks and training algorithms' is stated in the abstract and presumably justified later, but the conditions under which the point-process statistics remain valid outside the equivariant diffusion setting are not clearly delimited. Without these conditions, the generality claim risks overreach relative to the derivation's scope.

Authors: We appreciate the referee's caution regarding the scope of the generality claim. The extension is motivated by the geometric interpretation of the Representation Gap, which suggests that similar asymptotic behavior may hold whenever the task has a well-defined intrinsic dimension and the training induces representations that can be analyzed via point processes. Nevertheless, we agree that without explicit conditions, the claim could be seen as overreaching. In the revised manuscript, we will revise the relevant section to clearly delimit the conditions (e.g., requiring approximate equivariance or geometric regularity in the data manifold) under which the point-process statistics are expected to hold, and we will adjust the abstract and conclusions to reflect this more precise scope. revision: yes

Circularity Check

0 steps flagged

No circularity: asymptotic derivation applies pre-existing quantization results independently to NN representations

full rationale

The paper introduces the Representation Gap metric and derives its asymptotic equivalent by directly leveraging established results from optimal quantization and point-process theory, which are independent of the neural network training context. The single-parameter reduction to intrinsic dimension follows from those external limits rather than from fitting or redefining the target quantity within the paper. Empirical checks on synthetic datasets (where ground truth is known) and realistic data provide external falsifiability. No self-citation chain, self-definitional loop, or fitted-input-as-prediction pattern appears in the derivation; the equivariance link is a downstream consequence, not a load-bearing assumption that collapses the claim. The overall chain remains self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The derivation rests on the transfer of optimal quantization and point-process results to neural-network training dynamics; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Optimal quantization and point-process theory results apply to the asymptotic dynamics of equivariant diffusion models and neural network representations.
Explicitly invoked to derive the asymptotic equivalent of the Representation Gap.

pith-pipeline@v0.9.0 · 5734 in / 1173 out tokens · 26299 ms · 2026-05-22T10:05:07.370154+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the representation gap reduces to a quantization problem on the quotient manifold Ω/G … Rn ∼ Jd / n^{2/d} where d = d_Ω/G
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_add / orbit structure under generator echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

equivariant model virtually augments the dataset … Ωf = G(D)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

asymptotic equivalent … governed by a single parameter, the intrinsic dimension

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

[1]

Volume of small balls and sub-Riemannian curvature in 3D contact manifolds

Davide Barilari, Ivan Beschastnyi, and Antonio Lerario. V olume of small balls and sub-riemannian curvature in 3d contact manifolds.arXiv preprint arXiv:1802.10155,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

doi: 10.1109/TPAMI.2013.50

ISSN 1939-3539. doi: 10.1109/TPAMI.2013.50. URL https://ieeexplore. ieee.org/document/6472238/. Gérard Biau and Luc Devroye.Lectures on the nearest neighbor method, volume

work page doi:10.1109/tpami.2013.50 1939
[3]

doi: 10.1051/epjconf/202225809001

ISSN 2100-014X. doi: 10.1051/epjconf/202225809001. Shuxiao Chen, Edgar Dobriban, and Jane H Lee. A group-theoretic framework for data augmentation. Journal of Machine Learning Research,

work page doi:10.1051/epjconf/202225809001
[4]

doi: 10.1090/jams/852

ISSN 0894- 0347, 1088-6834. doi: 10.1090/jams/852. URL https://www.ams.org/jams/2016-29-04/ S0894-0347-2016-00852-4/. Emma Finn, T. Anderson Keller, Manos Theodosis, and Demba E. Ba. Origins of creativity in attention-based diffusion models. InHiLD at ICML

work page doi:10.1090/jams/852 2016
[5]

doi: 10.48550/arXiv.2506. 17324. URLhttp://arxiv.org/abs/2506.17324. S. Gallot, D. Hulin, and J. Lafontaine.Riemannian geometry. Universitext. Springer-Verlag, Berlin ; New York, 2nd ed edition,

work page doi:10.48550/arxiv.2506
[6]

Generating 3d adversarial point clouds, in: 2019 IEEE/CVF Conference on Computer Vision and PatternRecognition(CVPR),pp.9128–9136

ISBN 978-1-7281-3293-8. doi: 10.1109/CVPR.2019. 00411. URLhttps://ieeexplore.ieee.org/document/8953348/. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, June

work page doi:10.1109/cvpr.2019 2019
[7]

Approximation capabilities of multilayer feedforward networks

ISSN 0893-6080. doi: 10.1016/0893-6080(91)90009-T. URL https: //linkinghub.elsevier.com/retrieve/pii/089360809190009T. Mikaela Iacobelli. Asymptotic quantization for probability measures on riemannian manifolds. ESAIM: Control, Optimisation and Calculus of Variations, 22(3):770–785,

work page doi:10.1016/0893-6080(91)90009-t
[8]

Staying on the manifold: Geometry-aware noise injection.arXiv preprint arXiv:2509.20201,

11 Albert Kjøller Jacobsen, Johanna Marie Gegenfurtner, and Georgios Arvanitidis. Staying on the manifold: Geometry-aware noise injection.arXiv preprint arXiv:2509.20201,

work page arXiv
[9]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020a. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario...

work page internal anchor Pith review Pith/arXiv arXiv 2001
[10]

Gradient-based learning applied to document recognition,

ISSN 1558-2256. doi: 10.1109/5.726791. URLhttps://ieeexplore.ieee.org/document/726791/. John M. Lee.Riemannian Manifolds: An Introduction to Curvature. Springer Science & Business Media, April

work page doi:10.1109/5.726791
[11]

Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

ISBN 978-0-387-22726-9. Google-Books-ID: 92PgBwAAQBAJ. Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, and Stefano Soatto. On the scalability of diffusion-based text-to- image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9400–940...

work page arXiv
[12]

Fleet d, pajdla t, schiele b, tuytelaars t, et al

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Fleet d, pajdla t, schiele b, tuytelaars t, et al. microsoft coco: common objects in context.Computer Vision–ECCV 2014, pages 740–755, 2014a. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr...

work page 2014
[13]

doi: 10.3390/e23111403

ISSN 1099-4300. doi: 10.3390/e23111403. URL https://www.mdpi.com/1099-4300/23/ 11/1403. Publisher: Multidisciplinary Digital Publishing Institute. Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press,

work page doi:10.3390/e23111403
[14]

There Will Be a Scientific Theory of Deep Learning

Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, et al. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Emogen: Emotional image content generation with text-to-image diffusion models,

doi: 10.1109/CVPR52733.2024.00458. URL https://ieeexplore.ieee.org/document/ 10658325/. Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models.arXiv preprint arXiv:1511.01844,

work page doi:10.1109/cvpr52733.2024.00458 2024
[16]

Unbiased look at dataset bias

13 Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. InCVPR 2011, pages 1521–1528. IEEE,

work page 2011
[17]

PoseNet: A convolutional network for real-time 6-dof camera relocalization,

IEEE. ISBN 978-1-7281-4803-8. doi: 10.1109/ICCV .2019.00464. URL https: //ieeexplore.ieee.org/document/9010395/. Jiachen Yao, Mayank Goswami, and Chao Chen. A theoretical study of neural network expressive power via manifold topology, October

work page doi:10.1109/iccv 2019
[18]

doi: 10.1145/3446776

ISSN 0001-0782, 1557-7317. doi: 10.1145/3446776. 14 A Notations A.1 Task and geometry Manifold.We consider a supervised task, with input space X ⊂ R dX and target space Y ⊂ R dY . We assume that the observations(x, y) belong to a subset Ω⊂ X × Y , which models the structure of the task and its underlying symmetries. Following the manifold hypothesis (Beng...

work page doi:10.1145/3446776 2013
[19]

We denote by 1[E] the indicator function of a set E

the evaluation of its density at a point y. We denote by 1[E] the indicator function of a set E. For a finite set E, we denote by |E| its cardinality. IfEis measurable,|E|denotes its Lebesgue measure, and ˚Eits interior. Asymptotic notation.We denote by an ∼b n the deterministic asymptotic equivalence an bn →1 . Similarly, we write Xn ∼P an when Xn an →1 ...

work page 2022
[20]

Given a budget of n points, we are often interested in the best approximation achievable by a discrete set of size n

(see Section E.2). Given a budget of n points, we are often interested in the best approximation achievable by a discrete set of size n. This quantity, known as the optimal quantization error or optimal quantization risk (Graf and Luschgy, 2007), is defined as Rn(P) = inf z∈Y n Z Y min k∈[ [1,n] ] ℓ(y, zk)p(y)dy.(18) A central tool of our analysis is Zado...

work page 2007
[21]

Despite these similarities, Theorem 8 and Zador’s theorem describe fundamentally different types of asymptotic results

under a common geometric framework. Despite these similarities, Theorem 8 and Zador’s theorem describe fundamentally different types of asymptotic results. Theorem 8 characterizes the quantization error averaged over alli.i.d.datasets of size n, whereas Zador’s theorem characterizes the quantization error of a specific (optimal) point configuration. From ...

work page 2020
[22]

Among them, diffusion models can be shown to converge toward the empirical distribution 1 |D| P y∈D δy when they minimize their training objective (Song and Ermon, 2019)

or normalizing flows (Rezende and Mohamed, 2016). Among them, diffusion models can be shown to converge toward the empirical distribution 1 |D| P y∈D δy when they minimize their training objective (Song and Ermon, 2019). We will focus on this class of models hereafter. In this case, the empirical distribution corresponds to the prediction space Ωf learned...

work page 2016
[23]

Most notably, the leading constant depends on the geometry of Ω only via a volume termV ∗ d(p)

This result is remarkable, since it provides an asymptotic equivalent of the representation gap as the dataset size n grows to infinity. Most notably, the leading constant depends on the geometry of Ω only via a volume termV ∗ d(p). C.3 Asymptotic representation gap under the manifold hypothesis It is possible to extend this result when Ω is a low-dimensi...

work page 2001
[24]

Informally, the proof of Lemma 1 then relies on the two following approximations: I(t) = Z G(D) h(z)e−∥z− y αt ∥2/βtdz≈ Z G(D) h(z)e−∥z−y∥2/βtdz≈h(y ∗)e−∥y∗−y∥2/βt(2πβt)d/2 . The first approximation comes from integrating ∥z− y αt ∥2 =∥z−y∥ 2 +O(β t) over the orbit G(D), and the second approximation is an extension of Laplace approximation on measurable s...

work page 2010
[25]

Then, hypothesis(iv) implies that γtf(y t, t) =∂ tyt +γ tyt is bounded, which in turn implies yt −y ∗ t = (1−α t)f(y t, t)→0

By theorem B.3 in Kamb and Ganguli (2025), the score function by the modelfcan be written f(y t, t) =− 1 1−α t R G(D)(y− √αtz)N(y| √αtz,(1−α t)I)dz R G(D) N(y| √αtz,(1−α t)I)dz = 1 1−α t (yt−y∗ t )+o 1 1−α t , (32) where the second equality is a corollary of Lemma 1 to be justified later. Then, hypothesis(iv) implies that γtf(y t, t) =∂ tyt +γ tyt is boun...

work page 2025
[26]

Therefore, we can identify its prediction space Ωf with G(D)

Proposition 4 establishes that an equivariant diffusion modelf generates samples in G(D). Therefore, we can identify its prediction space Ωf with G(D). If the symmetry group G enforced by the architecture is aligned with the symmetries of the manifold Ω, then the effective dimension of the learning problem is reduced fromd Ω tod Ω/G. 22 Proposition 5(Repr...

work page 1990
[27]

E Link with related work In this section, we clarify the relations of the concept introduced in this article with several related works. 24 E.1 Generalization error A natural question is to relate the representation gap R(Ω,Ω f) to the generalization error (Shalev- Shwartz and Ben-David, 2014), commonly used to characterize generalization. We focus on the...

work page 2014
[28]

E.2 Wasserstein distance Wasserstein distance Peyré et al

Generalization error and representation gap are therefore closely related. E.2 Wasserstein distance Wasserstein distance Peyré et al. (2019) is typically used to measure neural network generalization (Theis et al., 2015). Interestingly, we can see that the representation gapR(Ω,Ω f) is a particular case of the Wasserstein distance W(Ω,Ω f), where each poi...

work page 2019

[1] [1]

Volume of small balls and sub-Riemannian curvature in 3D contact manifolds

Davide Barilari, Ivan Beschastnyi, and Antonio Lerario. V olume of small balls and sub-riemannian curvature in 3d contact manifolds.arXiv preprint arXiv:1802.10155,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

doi: 10.1109/TPAMI.2013.50

ISSN 1939-3539. doi: 10.1109/TPAMI.2013.50. URL https://ieeexplore. ieee.org/document/6472238/. Gérard Biau and Luc Devroye.Lectures on the nearest neighbor method, volume

work page doi:10.1109/tpami.2013.50 1939

[3] [3]

doi: 10.1051/epjconf/202225809001

ISSN 2100-014X. doi: 10.1051/epjconf/202225809001. Shuxiao Chen, Edgar Dobriban, and Jane H Lee. A group-theoretic framework for data augmentation. Journal of Machine Learning Research,

work page doi:10.1051/epjconf/202225809001

[4] [4]

doi: 10.1090/jams/852

ISSN 0894- 0347, 1088-6834. doi: 10.1090/jams/852. URL https://www.ams.org/jams/2016-29-04/ S0894-0347-2016-00852-4/. Emma Finn, T. Anderson Keller, Manos Theodosis, and Demba E. Ba. Origins of creativity in attention-based diffusion models. InHiLD at ICML

work page doi:10.1090/jams/852 2016

[5] [5]

doi: 10.48550/arXiv.2506. 17324. URLhttp://arxiv.org/abs/2506.17324. S. Gallot, D. Hulin, and J. Lafontaine.Riemannian geometry. Universitext. Springer-Verlag, Berlin ; New York, 2nd ed edition,

work page doi:10.48550/arxiv.2506

[6] [6]

Generating 3d adversarial point clouds, in: 2019 IEEE/CVF Conference on Computer Vision and PatternRecognition(CVPR),pp.9128–9136

ISBN 978-1-7281-3293-8. doi: 10.1109/CVPR.2019. 00411. URLhttps://ieeexplore.ieee.org/document/8953348/. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, June

work page doi:10.1109/cvpr.2019 2019

[7] [7]

Approximation capabilities of multilayer feedforward networks

ISSN 0893-6080. doi: 10.1016/0893-6080(91)90009-T. URL https: //linkinghub.elsevier.com/retrieve/pii/089360809190009T. Mikaela Iacobelli. Asymptotic quantization for probability measures on riemannian manifolds. ESAIM: Control, Optimisation and Calculus of Variations, 22(3):770–785,

work page doi:10.1016/0893-6080(91)90009-t

[8] [8]

Staying on the manifold: Geometry-aware noise injection.arXiv preprint arXiv:2509.20201,

11 Albert Kjøller Jacobsen, Johanna Marie Gegenfurtner, and Georgios Arvanitidis. Staying on the manifold: Geometry-aware noise injection.arXiv preprint arXiv:2509.20201,

work page arXiv

[9] [9]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020a. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario...

work page internal anchor Pith review Pith/arXiv arXiv 2001

[10] [10]

Gradient-based learning applied to document recognition,

ISSN 1558-2256. doi: 10.1109/5.726791. URLhttps://ieeexplore.ieee.org/document/726791/. John M. Lee.Riemannian Manifolds: An Introduction to Curvature. Springer Science & Business Media, April

work page doi:10.1109/5.726791

[11] [11]

Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

ISBN 978-0-387-22726-9. Google-Books-ID: 92PgBwAAQBAJ. Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, and Stefano Soatto. On the scalability of diffusion-based text-to- image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9400–940...

work page arXiv

[12] [12]

Fleet d, pajdla t, schiele b, tuytelaars t, et al

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Fleet d, pajdla t, schiele b, tuytelaars t, et al. microsoft coco: common objects in context.Computer Vision–ECCV 2014, pages 740–755, 2014a. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr...

work page 2014

[13] [13]

doi: 10.3390/e23111403

ISSN 1099-4300. doi: 10.3390/e23111403. URL https://www.mdpi.com/1099-4300/23/ 11/1403. Publisher: Multidisciplinary Digital Publishing Institute. Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press,

work page doi:10.3390/e23111403

[14] [14]

There Will Be a Scientific Theory of Deep Learning

Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, et al. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Emogen: Emotional image content generation with text-to-image diffusion models,

doi: 10.1109/CVPR52733.2024.00458. URL https://ieeexplore.ieee.org/document/ 10658325/. Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models.arXiv preprint arXiv:1511.01844,

work page doi:10.1109/cvpr52733.2024.00458 2024

[16] [16]

Unbiased look at dataset bias

13 Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. InCVPR 2011, pages 1521–1528. IEEE,

work page 2011

[17] [17]

PoseNet: A convolutional network for real-time 6-dof camera relocalization,

IEEE. ISBN 978-1-7281-4803-8. doi: 10.1109/ICCV .2019.00464. URL https: //ieeexplore.ieee.org/document/9010395/. Jiachen Yao, Mayank Goswami, and Chao Chen. A theoretical study of neural network expressive power via manifold topology, October

work page doi:10.1109/iccv 2019

[18] [18]

doi: 10.1145/3446776

ISSN 0001-0782, 1557-7317. doi: 10.1145/3446776. 14 A Notations A.1 Task and geometry Manifold.We consider a supervised task, with input space X ⊂ R dX and target space Y ⊂ R dY . We assume that the observations(x, y) belong to a subset Ω⊂ X × Y , which models the structure of the task and its underlying symmetries. Following the manifold hypothesis (Beng...

work page doi:10.1145/3446776 2013

[19] [19]

We denote by 1[E] the indicator function of a set E

the evaluation of its density at a point y. We denote by 1[E] the indicator function of a set E. For a finite set E, we denote by |E| its cardinality. IfEis measurable,|E|denotes its Lebesgue measure, and ˚Eits interior. Asymptotic notation.We denote by an ∼b n the deterministic asymptotic equivalence an bn →1 . Similarly, we write Xn ∼P an when Xn an →1 ...

work page 2022

[20] [20]

Given a budget of n points, we are often interested in the best approximation achievable by a discrete set of size n

(see Section E.2). Given a budget of n points, we are often interested in the best approximation achievable by a discrete set of size n. This quantity, known as the optimal quantization error or optimal quantization risk (Graf and Luschgy, 2007), is defined as Rn(P) = inf z∈Y n Z Y min k∈[ [1,n] ] ℓ(y, zk)p(y)dy.(18) A central tool of our analysis is Zado...

work page 2007

[21] [21]

Despite these similarities, Theorem 8 and Zador’s theorem describe fundamentally different types of asymptotic results

under a common geometric framework. Despite these similarities, Theorem 8 and Zador’s theorem describe fundamentally different types of asymptotic results. Theorem 8 characterizes the quantization error averaged over alli.i.d.datasets of size n, whereas Zador’s theorem characterizes the quantization error of a specific (optimal) point configuration. From ...

work page 2020

[22] [22]

Among them, diffusion models can be shown to converge toward the empirical distribution 1 |D| P y∈D δy when they minimize their training objective (Song and Ermon, 2019)

or normalizing flows (Rezende and Mohamed, 2016). Among them, diffusion models can be shown to converge toward the empirical distribution 1 |D| P y∈D δy when they minimize their training objective (Song and Ermon, 2019). We will focus on this class of models hereafter. In this case, the empirical distribution corresponds to the prediction space Ωf learned...

work page 2016

[23] [23]

Most notably, the leading constant depends on the geometry of Ω only via a volume termV ∗ d(p)

This result is remarkable, since it provides an asymptotic equivalent of the representation gap as the dataset size n grows to infinity. Most notably, the leading constant depends on the geometry of Ω only via a volume termV ∗ d(p). C.3 Asymptotic representation gap under the manifold hypothesis It is possible to extend this result when Ω is a low-dimensi...

work page 2001

[24] [24]

Informally, the proof of Lemma 1 then relies on the two following approximations: I(t) = Z G(D) h(z)e−∥z− y αt ∥2/βtdz≈ Z G(D) h(z)e−∥z−y∥2/βtdz≈h(y ∗)e−∥y∗−y∥2/βt(2πβt)d/2 . The first approximation comes from integrating ∥z− y αt ∥2 =∥z−y∥ 2 +O(β t) over the orbit G(D), and the second approximation is an extension of Laplace approximation on measurable s...

work page 2010

[25] [25]

Then, hypothesis(iv) implies that γtf(y t, t) =∂ tyt +γ tyt is bounded, which in turn implies yt −y ∗ t = (1−α t)f(y t, t)→0

By theorem B.3 in Kamb and Ganguli (2025), the score function by the modelfcan be written f(y t, t) =− 1 1−α t R G(D)(y− √αtz)N(y| √αtz,(1−α t)I)dz R G(D) N(y| √αtz,(1−α t)I)dz = 1 1−α t (yt−y∗ t )+o 1 1−α t , (32) where the second equality is a corollary of Lemma 1 to be justified later. Then, hypothesis(iv) implies that γtf(y t, t) =∂ tyt +γ tyt is boun...

work page 2025

[26] [26]

Therefore, we can identify its prediction space Ωf with G(D)

Proposition 4 establishes that an equivariant diffusion modelf generates samples in G(D). Therefore, we can identify its prediction space Ωf with G(D). If the symmetry group G enforced by the architecture is aligned with the symmetries of the manifold Ω, then the effective dimension of the learning problem is reduced fromd Ω tod Ω/G. 22 Proposition 5(Repr...

work page 1990

[27] [27]

E Link with related work In this section, we clarify the relations of the concept introduced in this article with several related works. 24 E.1 Generalization error A natural question is to relate the representation gap R(Ω,Ω f) to the generalization error (Shalev- Shwartz and Ben-David, 2014), commonly used to characterize generalization. We focus on the...

work page 2014

[28] [28]

E.2 Wasserstein distance Wasserstein distance Peyré et al

Generalization error and representation gap are therefore closely related. E.2 Wasserstein distance Wasserstein distance Peyré et al. (2019) is typically used to measure neural network generalization (Theis et al., 2015). Interestingly, we can see that the representation gapR(Ω,Ω f) is a particular case of the Wasserstein distance W(Ω,Ω f), where each poi...

work page 2019