Pointwise Generalization in Deep Neural Networks

Shaojie Li; Yunbei Xu

arxiv: 2605.18598 · v1 · pith:FIB6NIWInew · submitted 2026-05-18 · 💻 cs.LG · cond-mat.stat-mech· math.FA· math.PR· math.ST· stat.TH

Pointwise Generalization in Deep Neural Networks

Shaojie Li , Yunbei Xu This is my paper

Pith reviewed 2026-05-20 12:31 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.stat-mechmath.FAmath.PRmath.STstat.TH

keywords deep neural networksgeneralization boundsRiemannian dimensionfeature representationsnonlinear feature learningpointwise complexityrepresentation learningimplicit bias

0 comments

The pith

Deep neural networks generalize because each trained model has a pointwise Riemannian dimension drawn from the eigenvalues of its learned feature representations across layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a pointwise generalization theory for fully connected deep networks that directly characterizes any trained hypothesis by a Riemannian dimension computed from the spectrum of its internal feature representations. This moves past fixed measures such as parameter count or norm products and instead ties complexity to the actual features the network has learned. A reader would care because the resulting bounds are hypothesis-dependent and orders of magnitude tighter than earlier guarantees, both on paper and in numerical checks. The work also supplies analytic reasons why deep networks remain tractable once the nonlinear regime is properly modeled.

Core claim

For each trained model the hypothesis is characterized by a pointwise Riemannian Dimension obtained from the eigenvalues of the learned feature representations across layers. This supplies a principled way to derive hypothesis-dependent, representation-aware generalization bounds that improve systematically over size-based, norm-product, and infinite-width linearization approaches. The same dimension reveals structural properties that make deep networks tractable, shows clear feature compression, shrinks with over-parameterization, and registers the implicit bias of the optimizer.

What carries the argument

The pointwise Riemannian Dimension, assembled from the eigenvalues of learned feature representations at each layer, which serves as a hypothesis-specific complexity measure in the nonlinear regime.

If this is right

Generalization guarantees become dependent on the actual learned features rather than on architecture size alone.
The bounds are orders of magnitude tighter than those based on products of norms or infinite-width approximations.
The dimension decreases with greater over-parameterization while still controlling the gap.
Optimizer implicit bias becomes visible through changes in the feature spectrum.
Deep networks are shown to be mathematically tractable once pointwise, spectrum-aware complexity is used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same eigenvalue construction could be applied to convolutional or attention-based layers if their feature maps are treated analogously.
Training algorithms might be modified to explicitly penalize growth in the leading eigenvalues of intermediate representations.
The dimension offers a post-training diagnostic that could replace or supplement validation-set estimates of generalization.

Load-bearing premise

The eigenvalues of the learned feature representations across layers can be assembled into a pointwise Riemannian Dimension that faithfully captures hypothesis complexity and supports valid generalization bounds.

What would settle it

A benchmark experiment in which the computed pointwise Riemannian Dimension fails to produce generalization bounds tighter than existing norm- or size-based bounds on standard image or language tasks.

Figures

Figures reproduced from arXiv: 2605.18598 by Shaojie Li, Yunbei Xu.

**Figure 2.** Figure 2: Effective Rank evolutions of FCNs on MNIST (left) and ResNets on CIFAR-10 (right) [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗

**Figure 3.** Figure 3: Riemannian Dimension evolutions of FCNs on MNIST (left) and ResNets on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗

read the original abstract

We address the fundamental question of why deep neural networks generalize by establishing a pointwise generalization theory for fully connected networks. This framework resolves long-standing barriers to characterizing the rich nonlinear feature-learning regime and builds a new statistical foundation for representation learning. For each trained model, we characterize the hypothesis via a pointwise Riemannian Dimension, derived from the eigenvalues of the learned feature representations across layers. This establishes a principled framework for deriving hypothesis-dependent, representation-aware generalization bounds. These bounds offer a systematic upgrade over approaches based on model size, products of norms, and infinite-width linearizations, yielding guarantees that are orders of magnitude tighter in both theory and experiment. Analytically, we identify the structural properties and mathematical principles that explain the tractability of deep networks. Empirically, the pointwise Riemannian Dimension exhibits substantial feature compression, decreases with increased over-parameterization, and captures the implicit bias of optimizers. Taken together, our results indicate that deep networks are mathematically tractable in practical regimes and that their generalization is sharply explained by pointwise, feature-spectrum-aware complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a pointwise Riemannian Dimension from eigenvalue spectra of layer activations to produce tighter hypothesis-dependent generalization bounds that avoid the usual circularity traps.

read the letter

The main thing to know is that this work builds a complexity measure directly from the learned representations in a trained network. They extract effective dimensions from the eigenvalues of the Gram matrices of activations across layers, call the result the pointwise Riemannian Dimension, and substitute it into a covering-number or Rademacher bound that already uses the data-dependent feature map. This yields bounds that are substantially tighter than those based on parameter count, norm products, or infinite-width linearizations, and the construction does not collapse into post-hoc fitting on the test data.

Referee Report

0 major / 3 minor

Summary. The paper establishes a pointwise generalization theory for fully connected deep neural networks. It defines a pointwise Riemannian Dimension for each trained model from the eigenvalues of the Gram matrices of layer activations, then substitutes this quantity into standard covering-number or Rademacher arguments to obtain hypothesis-dependent, representation-aware generalization bounds. The work claims these bounds are orders of magnitude tighter than those based on model size, norm products, or infinite-width linearizations, while also identifying structural properties that explain tractability and presenting empirical results on feature compression, dependence on over-parameterization, and optimizer implicit bias.

Significance. If the central construction holds, the framework offers a concrete advance in characterizing generalization for the nonlinear feature-learning regime by replacing global complexity measures with a pointwise, spectrum-aware quantity. The absence of circularity in the derivation (the dimension is assembled directly from the learned activations and inserted into an existing data-dependent bound) and the consistency of the reported empirical plots with the stated claims on dimension scaling are notable strengths.

minor comments (3)

The precise aggregation rule (product versus sum) used to combine per-layer effective dimensions into the final pointwise Riemannian Dimension should be stated explicitly in the main text, with a short justification for the chosen form.
Figure captions for the dimension-versus-width and optimizer-bias plots should include the number of independent runs and any error bars to allow readers to assess variability.
A brief comparison table placing the new bounds against the best previously published numerical values on the same architectures and datasets would strengthen the 'orders of magnitude tighter' claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive assessment of our manuscript, including the accurate summary of our pointwise Riemannian Dimension construction and its use in representation-aware bounds. The recommendation for minor revision is appreciated. No specific major comments were raised in the report, so we have no point-by-point rebuttals to offer at this time. We remain available to address any additional minor suggestions or clarifications that may arise during the revision process.

Circularity Check

0 steps flagged

No significant circularity detected in derivation

full rationale

The central construction extracts a pointwise Riemannian Dimension as a product or sum of effective dimensions computed from the eigenvalue spectra of Gram matrices of the trained network's layer activations. This quantity is then inserted into a standard covering-number or Rademacher-complexity bound that already incorporates the data-dependent feature map. No equation reduces the final bound to a fitted parameter on the same data, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in; the derivation remains self-contained against external benchmarks and the reported empirical trends follow directly from the stated definitions without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; ledger populated from stated contributions. The central new object is the Riemannian Dimension itself; no explicit free parameters or external axioms are named.

axioms (1)

domain assumption Eigenvalues of learned feature representations across layers can be combined into a dimension that controls generalization error in the nonlinear regime.
This premise is invoked to justify the entire bound framework but is not derived in the abstract.

invented entities (1)

pointwise Riemannian Dimension no independent evidence
purpose: Characterize each trained hypothesis for representation-aware generalization bounds
Newly introduced quantity derived from feature eigenvalues; no independent evidence outside the paper is mentioned.

pith-pipeline@v0.9.0 · 5720 in / 1377 out tokens · 75630 ms · 2026-05-20T12:31:20.596073+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

pointwise Riemannian Dimension, derived from the eigenvalues of the learned feature representations across layers... effective dimension deff(G(W), R, ε) := ½ ∑ log(8 R² λk / n ε²)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ellipsoidal covering of the Grassmannian... Lemma 3

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

191 extracted references · 191 canonical work pages · 10 internal anchors

[1]

Sharp nonasymptotic bounds on the norm of random matrices with independent entries , author=

work page
[2]

Transactions of the American Mathematical Society , volume=

On the spectral norm of Gaussian random matrices , author=. Transactions of the American Mathematical Society , volume=

work page
[3]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[4]

Neural computation , volume=

Flat minima , author=. Neural computation , volume=

work page
[5]

1999 , publisher=

The nature of statistical learning theory , author=. 1999 , publisher=

work page 1999
[6]

International Conference on Learning Representations , year=

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. International Conference on Learning Representations , year=

work page
[7]

Mathematics of Operations Research , volume=

Towards optimal problem dependent generalization error bounds in statistical learning theory , author=. Mathematics of Operations Research , volume=. 2025 , publisher=

work page 2025
[8]

International Conference on Learning Representations , year=

Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page
[9]

Proceedings of the eleventh annual conference on Computational learning theory , pages=

Some pac-bayesian theorems , author=. Proceedings of the eleventh annual conference on Computational learning theory , pages=

work page
[10]

Proceedings of the twelfth annual conference on Computational learning theory , pages=

PAC-Bayesian model averaging , author=. Proceedings of the twelfth annual conference on Computational learning theory , pages=

work page
[11]

Foundations and Trends

User-friendly introduction to PAC-Bayes bounds , author=. Foundations and Trends. 2024 , publisher=

work page 2024
[12]

A Note on Pointwise Dimensions

A note on pointwise dimensions , author=. arXiv preprint arXiv:1612.05849 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Advances in Neural Information Processing Systems , volume=

PAC-Bayes-empirical-Bernstein inequality , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

Conference on Learning Theory , pages=

Concentration of non-isotropic random tensors with applications to learning and empirical risk minimization , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021
[15]

Advances in Neural Information Processing Systems , volume=

Representational strengths and limitations of transformers , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

A Note on the PAC Bayesian Theorem

A note on the PAC Bayesian theorem , author=. arXiv preprint cs/0411099 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

preprint , volume=

A PAC-Bayesian approach to adaptive classification , author=. preprint , volume=

work page
[18]

International Conference on Algorithmic Learning Theory , pages=

A strongly quasiconvex PAC-Bayesian bound , author=. International Conference on Algorithmic Learning Theory , pages=. 2017 , organization=

work page 2017
[19]

Probabilistic methods for algorithmic discrete mathematics , pages=

Concentration , author=. Probabilistic methods for algorithmic discrete mathematics , pages=. 1998 , publisher=

work page 1998
[20]

2013 , publisher=

Convex bodies: the Brunn--Minkowski theory , author=. 2013 , publisher=

work page 2013
[21]

Advances in neural information processing systems , volume=

Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=

work page
[22]

International Conference on Learning Representations , year=

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , author=. International Conference on Learning Representations , year=

work page
[23]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[24]

IEEE transactions on pattern analysis and machine intelligence , volume=

Representation learning: A review and new perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2013 , publisher=

work page 2013
[25]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[26]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 , pages=

Visualizing and understanding convolutional networks , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 , pages=. 2014 , organization=

work page 2014
[27]

Journal of Machine Learning Research , volume=

Rademacher and gaussian complexities: Risk bounds and structural results , author=. Journal of Machine Learning Research , volume=

work page
[28]

Information and Inference: A Journal of the IMA , volume=

Size-independent sample complexity of neural networks , author=. Information and Inference: A Journal of the IMA , volume=. 2020 , publisher=

work page 2020
[29]

International conference on machine learning , pages=

Stronger generalization bounds for deep nets via a compression approach , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[30]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

work page
[31]

Advances in neural information processing systems , volume=

On exact computation with an infinitely wide neural net , author=. Advances in neural information processing systems , volume=

work page
[32]

Advances in neural information processing systems , volume=

PAC-Bayesian generic chaining , author=. Advances in neural information processing systems , volume=

work page
[33]

2005 , publisher=

The generic chaining: upper and lower bounds of stochastic processes , author=. 2005 , publisher=

work page 2005
[34]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018
[35]

, author=

Combining PAC-Bayesian and Generic Chaining Bounds. , author=. Journal of Machine Learning Research , volume=

work page
[36]

International Conference on Machine Learning , pages=

Bayesian design principles for frequentist sequential learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[37]

Advances in Neural Information Processing Systems , volume=

Assouad, Fano, and Le Cam with Interaction: A Unifying Lower Bound Framework and Characterization for Bandit Learnability , author=. Advances in Neural Information Processing Systems , volume=

work page
[38]

Conference on Learning Theory , pages=

Majorizing measures, sequential complexities, and online learning , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021
[39]

Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing , pages=

Adversarial laws of large numbers and optimal regret in online classification , author=. Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing , pages=

work page
[40]

Advances in neural information processing systems , volume=

Efficient and accurate estimation of lipschitz constants for deep neural networks , author=. Advances in neural information processing systems , volume=

work page
[41]

Advances in Neural Information Processing Systems , volume=

PAC-Bayes compression bounds so tight that they can explain generalization , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

1997 , publisher=

Techniques in fractal geometry , author=. 1997 , publisher=

work page 1997
[43]

Opening the Black Box of Deep Neural Networks via Information

Opening the black box of deep neural networks via information , author=. arXiv preprint arXiv:1703.00810 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

International conference on machine learning , pages=

On the spectral bias of neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[45]

Metric Entropy of Homogeneous Spaces

Metric entropy of homogeneous spaces , author=. arXiv preprint math/9701213 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[47]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019
[48]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[49]

2008 , publisher=

Riemannian geometry and geometric analysis , author=. 2008 , publisher=

work page 2008
[50]

arXiv preprint arXiv:2004.13135 , year=

Local lipschitz bounds of deep neural networks , author=. arXiv preprint arXiv:2004.13135 , year=

work page arXiv 2004
[51]

IEEE transactions on information theory , volume=

On coverings of ellipsoids in Euclidean spaces , author=. IEEE transactions on information theory , volume=. 2004 , publisher=

work page 2004
[52]

The American mathematical monthly , volume=

Cauchy's interlace theorem for eigenvalues of Hermitian matrices , author=. The American mathematical monthly , volume=. 2004 , publisher=

work page 2004
[53]

Proceedings of the National Academy of Sciences , volume=

Prevalence of neural collapse during the terminal phase of deep learning training , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020
[54]

Communications of the ACM , volume=

On the implicit bias in deep-learning algorithms , author=. Communications of the ACM , volume=. 2023 , publisher=

work page 2023
[55]

arXiv preprint arXiv:2010.02501 , year=

A unifying view on implicit bias in training linear neural networks , author=. arXiv preprint arXiv:2010.02501 , year=

work page arXiv 2010
[56]

International Conference on Machine Learning , pages=

On the implicit bias of initialization shape: Beyond infinitesimal mirror descent , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[57]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=

work page
[59]

Cited on , volume=

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent , author=. Cited on , volume=

work page
[60]

Acta Mathematica , volume=

Regularity of gaussian processes , author=. Acta Mathematica , volume=. 1987 , publisher=

work page 1987
[61]

Ecole d’Et

Regularite des trajectoires des fonctions aleatoires gaussiennes , author=. Ecole d’Et. 1974 , publisher=

work page 1974
[62]

Lecture Notes (Princeton University) , volume=

Probability in high dimension , author=. Lecture Notes (Princeton University) , volume=

work page
[63]

Alessandro Rinaldo and Enxu Yan , title =

work page
[64]

2023 IEEE International Symposium on Information Theory (ISIT) , pages=

Majorizing Measures, Codes, and Information , author=. 2023 IEEE International Symposium on Information Theory (ISIT) , pages=. 2023 , organization=

work page 2023
[65]

Minkowski inequality , howpublished =

work page
[66]

Normed vector space , howpublished =

work page
[67]

Min–Max Theorem , author =

work page
[68]

Envelope theorem , howpublished =

work page
[69]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998
[70]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Advances in neural information processing systems , volume=

Exploring generalization in deep learning , author=. Advances in neural information processing systems , volume=

work page
[72]

NIPS Workshop on Deep Learning and Unsupervised Feature Learning , year=

Reading digits in natural images with unsupervised feature learning , author=. NIPS Workshop on Deep Learning and Unsupervised Feature Learning , year=

work page
[73]

Learning multiple layers of features from tiny images , author=

work page
[74]

Foundations and Trends

Sketching as a tool for numerical linear algebra , author=. Foundations and Trends. 2014 , publisher=

work page 2014
[75]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data , author=. arXiv preprint arXiv:1703.11008 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

SIAM journal on Matrix Analysis and Applications , volume=

The geometry of algorithms with orthogonality constraints , author=. SIAM journal on Matrix Analysis and Applications , volume=. 1998 , publisher=

work page 1998
[77]

Advances In Neural Information Processing Systems , volume=

Generalization error bounds for collaborative prediction with low-rank matrices , author=. Advances In Neural Information Processing Systems , volume=

work page
[78]

2013 , publisher=

Matrix computations , author=. 2013 , publisher=

work page 2013
[79]

2013 , publisher=

Matrix analysis , author=. 2013 , publisher=

work page 2013
[80]

Convex Geometric Analysis , volume=

Metric entropy of the Grassmann manifold , author=. Convex Geometric Analysis , volume=. 1998 , publisher=

work page 1998

Showing first 80 references.

[1] [1]

Sharp nonasymptotic bounds on the norm of random matrices with independent entries , author=

work page

[2] [2]

Transactions of the American Mathematical Society , volume=

On the spectral norm of Gaussian random matrices , author=. Transactions of the American Mathematical Society , volume=

work page

[3] [3]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017

[4] [4]

Neural computation , volume=

Flat minima , author=. Neural computation , volume=

work page

[5] [5]

1999 , publisher=

The nature of statistical learning theory , author=. 1999 , publisher=

work page 1999

[6] [6]

International Conference on Learning Representations , year=

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. International Conference on Learning Representations , year=

work page

[7] [7]

Mathematics of Operations Research , volume=

Towards optimal problem dependent generalization error bounds in statistical learning theory , author=. Mathematics of Operations Research , volume=. 2025 , publisher=

work page 2025

[8] [8]

International Conference on Learning Representations , year=

Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page

[9] [9]

Proceedings of the eleventh annual conference on Computational learning theory , pages=

Some pac-bayesian theorems , author=. Proceedings of the eleventh annual conference on Computational learning theory , pages=

work page

[10] [10]

Proceedings of the twelfth annual conference on Computational learning theory , pages=

PAC-Bayesian model averaging , author=. Proceedings of the twelfth annual conference on Computational learning theory , pages=

work page

[11] [11]

Foundations and Trends

User-friendly introduction to PAC-Bayes bounds , author=. Foundations and Trends. 2024 , publisher=

work page 2024

[12] [12]

A Note on Pointwise Dimensions

A note on pointwise dimensions , author=. arXiv preprint arXiv:1612.05849 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Advances in Neural Information Processing Systems , volume=

PAC-Bayes-empirical-Bernstein inequality , author=. Advances in Neural Information Processing Systems , volume=

work page

[14] [14]

Conference on Learning Theory , pages=

Concentration of non-isotropic random tensors with applications to learning and empirical risk minimization , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021

[15] [15]

Advances in Neural Information Processing Systems , volume=

Representational strengths and limitations of transformers , author=. Advances in Neural Information Processing Systems , volume=

work page

[16] [16]

A Note on the PAC Bayesian Theorem

A note on the PAC Bayesian theorem , author=. arXiv preprint cs/0411099 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

preprint , volume=

A PAC-Bayesian approach to adaptive classification , author=. preprint , volume=

work page

[18] [18]

International Conference on Algorithmic Learning Theory , pages=

A strongly quasiconvex PAC-Bayesian bound , author=. International Conference on Algorithmic Learning Theory , pages=. 2017 , organization=

work page 2017

[19] [19]

Probabilistic methods for algorithmic discrete mathematics , pages=

Concentration , author=. Probabilistic methods for algorithmic discrete mathematics , pages=. 1998 , publisher=

work page 1998

[20] [20]

2013 , publisher=

Convex bodies: the Brunn--Minkowski theory , author=. 2013 , publisher=

work page 2013

[21] [21]

Advances in neural information processing systems , volume=

Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=

work page

[22] [22]

International Conference on Learning Representations , year=

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , author=. International Conference on Learning Representations , year=

work page

[23] [23]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015

[24] [24]

IEEE transactions on pattern analysis and machine intelligence , volume=

Representation learning: A review and new perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2013 , publisher=

work page 2013

[25] [25]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[26] [26]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 , pages=

Visualizing and understanding convolutional networks , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 , pages=. 2014 , organization=

work page 2014

[27] [27]

Journal of Machine Learning Research , volume=

Rademacher and gaussian complexities: Risk bounds and structural results , author=. Journal of Machine Learning Research , volume=

work page

[28] [28]

Information and Inference: A Journal of the IMA , volume=

Size-independent sample complexity of neural networks , author=. Information and Inference: A Journal of the IMA , volume=. 2020 , publisher=

work page 2020

[29] [29]

International conference on machine learning , pages=

Stronger generalization bounds for deep nets via a compression approach , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[30] [30]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

work page

[31] [31]

Advances in neural information processing systems , volume=

On exact computation with an infinitely wide neural net , author=. Advances in neural information processing systems , volume=

work page

[32] [32]

Advances in neural information processing systems , volume=

PAC-Bayesian generic chaining , author=. Advances in neural information processing systems , volume=

work page

[33] [33]

2005 , publisher=

The generic chaining: upper and lower bounds of stochastic processes , author=. 2005 , publisher=

work page 2005

[34] [34]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018

[35] [35]

, author=

Combining PAC-Bayesian and Generic Chaining Bounds. , author=. Journal of Machine Learning Research , volume=

work page

[36] [36]

International Conference on Machine Learning , pages=

Bayesian design principles for frequentist sequential learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[37] [37]

Advances in Neural Information Processing Systems , volume=

Assouad, Fano, and Le Cam with Interaction: A Unifying Lower Bound Framework and Characterization for Bandit Learnability , author=. Advances in Neural Information Processing Systems , volume=

work page

[38] [38]

Conference on Learning Theory , pages=

Majorizing measures, sequential complexities, and online learning , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021

[39] [39]

Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing , pages=

Adversarial laws of large numbers and optimal regret in online classification , author=. Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing , pages=

work page

[40] [40]

Advances in neural information processing systems , volume=

Efficient and accurate estimation of lipschitz constants for deep neural networks , author=. Advances in neural information processing systems , volume=

work page

[41] [41]

Advances in Neural Information Processing Systems , volume=

PAC-Bayes compression bounds so tight that they can explain generalization , author=. Advances in Neural Information Processing Systems , volume=

work page

[42] [42]

1997 , publisher=

Techniques in fractal geometry , author=. 1997 , publisher=

work page 1997

[43] [43]

Opening the Black Box of Deep Neural Networks via Information

Opening the black box of deep neural networks via information , author=. arXiv preprint arXiv:1703.00810 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

International conference on machine learning , pages=

On the spectral bias of neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019

[45] [45]

Metric Entropy of Homogeneous Spaces

Metric entropy of homogeneous spaces , author=. arXiv preprint math/9701213 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[47] [47]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019

[48] [48]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[49] [49]

2008 , publisher=

Riemannian geometry and geometric analysis , author=. 2008 , publisher=

work page 2008

[50] [50]

arXiv preprint arXiv:2004.13135 , year=

Local lipschitz bounds of deep neural networks , author=. arXiv preprint arXiv:2004.13135 , year=

work page arXiv 2004

[51] [51]

IEEE transactions on information theory , volume=

On coverings of ellipsoids in Euclidean spaces , author=. IEEE transactions on information theory , volume=. 2004 , publisher=

work page 2004

[52] [52]

The American mathematical monthly , volume=

Cauchy's interlace theorem for eigenvalues of Hermitian matrices , author=. The American mathematical monthly , volume=. 2004 , publisher=

work page 2004

[53] [53]

Proceedings of the National Academy of Sciences , volume=

Prevalence of neural collapse during the terminal phase of deep learning training , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020

[54] [54]

Communications of the ACM , volume=

On the implicit bias in deep-learning algorithms , author=. Communications of the ACM , volume=. 2023 , publisher=

work page 2023

[55] [55]

arXiv preprint arXiv:2010.02501 , year=

A unifying view on implicit bias in training linear neural networks , author=. arXiv preprint arXiv:2010.02501 , year=

work page arXiv 2010

[56] [56]

International Conference on Machine Learning , pages=

On the implicit bias of initialization shape: Beyond infinitesimal mirror descent , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[57] [57]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=

work page

[59] [59]

Cited on , volume=

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent , author=. Cited on , volume=

work page

[60] [60]

Acta Mathematica , volume=

Regularity of gaussian processes , author=. Acta Mathematica , volume=. 1987 , publisher=

work page 1987

[61] [61]

Ecole d’Et

Regularite des trajectoires des fonctions aleatoires gaussiennes , author=. Ecole d’Et. 1974 , publisher=

work page 1974

[62] [62]

Lecture Notes (Princeton University) , volume=

Probability in high dimension , author=. Lecture Notes (Princeton University) , volume=

work page

[63] [63]

Alessandro Rinaldo and Enxu Yan , title =

work page

[64] [64]

2023 IEEE International Symposium on Information Theory (ISIT) , pages=

Majorizing Measures, Codes, and Information , author=. 2023 IEEE International Symposium on Information Theory (ISIT) , pages=. 2023 , organization=

work page 2023

[65] [65]

Minkowski inequality , howpublished =

work page

[66] [66]

Normed vector space , howpublished =

work page

[67] [67]

Min–Max Theorem , author =

work page

[68] [68]

Envelope theorem , howpublished =

work page

[69] [69]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998

[70] [70]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[71] [71]

Advances in neural information processing systems , volume=

Exploring generalization in deep learning , author=. Advances in neural information processing systems , volume=

work page

[72] [72]

NIPS Workshop on Deep Learning and Unsupervised Feature Learning , year=

Reading digits in natural images with unsupervised feature learning , author=. NIPS Workshop on Deep Learning and Unsupervised Feature Learning , year=

work page

[73] [73]

Learning multiple layers of features from tiny images , author=

work page

[74] [74]

Foundations and Trends

Sketching as a tool for numerical linear algebra , author=. Foundations and Trends. 2014 , publisher=

work page 2014

[75] [75]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data , author=. arXiv preprint arXiv:1703.11008 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[76] [76]

SIAM journal on Matrix Analysis and Applications , volume=

The geometry of algorithms with orthogonality constraints , author=. SIAM journal on Matrix Analysis and Applications , volume=. 1998 , publisher=

work page 1998

[77] [77]

Advances In Neural Information Processing Systems , volume=

Generalization error bounds for collaborative prediction with low-rank matrices , author=. Advances In Neural Information Processing Systems , volume=

work page

[78] [78]

2013 , publisher=

Matrix computations , author=. 2013 , publisher=

work page 2013

[79] [79]

2013 , publisher=

Matrix analysis , author=. 2013 , publisher=

work page 2013

[80] [80]

Convex Geometric Analysis , volume=

Metric entropy of the Grassmann manifold , author=. Convex Geometric Analysis , volume=. 1998 , publisher=

work page 1998