Learning Sparse Compositional Functions with Norm-Constrained Neural Networks

Lorenzo Fiorito; Lorenzo Rosasco; Shuo Huang; Tomaso Poggio

arxiv: 2605.25608 · v1 · pith:PSIBWK5Hnew · submitted 2026-05-25 · 📊 stat.ML · cs.LG

Learning Sparse Compositional Functions with Norm-Constrained Neural Networks

Shuo Huang , Lorenzo Fiorito , Lorenzo Rosasco , Tomaso Poggio This is my paper

Pith reviewed 2026-06-29 20:30 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords sparse compositional functionsdirected acyclic graphsFrobenius normdeep neural networksapproximation ratesexcess risk boundscurse of dimensionalityoverparameterized regimes

0 comments

The pith

Frobenius norm-constrained deep neural networks achieve approximation rates and excess risk bounds for sparse compositional functions represented by DAGs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes approximation rates and excess risk bounds for learning sparse compositional functions whose structure is captured by directed acyclic graphs, using deep neural networks whose parameters are constrained in Frobenius norm. This framework measures complexity through the norm rather than parameter count, allowing non-vacuous guarantees in overparameterized regimes where parameters exceed samples. The approach covers multi-index models, binary tree structures, and general compositional architectures because every efficiently Turing-computable function admits sparse compositional representations via DAGs. The derived rates demonstrate that networks exploit hierarchical structure to avoid the curse of dimensionality.

Core claim

We establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representati

What carries the argument

Frobenius norm-constrained deep neural networks applied to DAG representations of sparse compositional structure.

If this is right

Deep networks exploit the compositional structure of target functions to avoid the curse of dimensionality.
The framework applies to multi-index models, binary tree structures, and general compositional architectures.
Every efficiently Turing computable function admits sparse compositional representations via DAGs.
The norm-based complexity measure produces non-vacuous bounds when the number of parameters exceeds the sample size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regularization that explicitly controls Frobenius norm during training may be especially effective for tasks with hidden compositional structure.
The DAG representation could be relaxed to allow approximate or noisy compositional graphs while retaining similar rates.
Similar norm-based analysis might extend to recurrent or attention-based architectures that also process hierarchical data.

Load-bearing premise

The target functions admit sparse compositional representations via DAGs and the Frobenius norm of network parameters provides an appropriate complexity measure that yields non-vacuous bounds in the overparameterized regime.

What would settle it

A concrete sparse compositional function on a DAG for which the approximation rate or excess risk bound fails to improve over unstructured high-dimensional learning when the network is constrained only by Frobenius norm.

Figures

Figures reproduced from arXiv: 2605.25608 by Lorenzo Fiorito, Lorenzo Rosasco, Shuo Huang, Tomaso Poggio.

**Figure 2.** Figure 2: Binary-tree construction of the monomial approximator, illustrated for the case [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

**Figure 3.** Figure 3: Partition of Unity via Localized Hat Functions. The plot illustrates the basis functions [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

read the original abstract

The ability of deep neural networks to learn hierarchical features is widely regarded as a key mechanism underlying their success in high-dimensional learning. Existing theory partially supports this view by establishing approximation rates based on parameter counts and sample complexity guarantees for compositional models without incurring the curse of dimensionality (CoD). To study overparameterized regimes, where the number of parameters exceeds the sample size, we develop a framework that measures complexity via the parameter norm. Within this approach, we establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives approximation and excess risk bounds for DAG-structured sparse compositional functions via Frobenius-norm constrained networks in overparameterized regimes.

read the letter

The core claim is that Frobenius norm constraints on deep networks yield approximation rates and excess risk bounds for sparse compositional functions whose structure is captured by DAGs. This targets the overparameterized case and aims to avoid the curse of dimensionality.

What is new is the explicit use of the norm as the complexity measure for these DAG models when parameters exceed samples. The work extends compositional approximation theory to this setting and notes that the framework covers efficiently Turing-computable targets, including multi-index models, binary trees, and general compositional architectures.

The paper does a reasonable job stating the motivation and the breadth of the class of functions. The hierarchical representation argument for dodging the curse of dimensionality follows standard lines in this literature and is internally consistent.

The soft spot is that the abstract supplies no derivations or proof sketches, so it is impossible to check whether the stated rates actually follow from the norm constraint alone or whether extra assumptions on smoothness or data distribution are required to make the bounds non-vacuous. If the full paper contains the technical steps, that would clarify the contribution.

This is for readers working on generalization bounds for structured high-dimensional functions in statistical learning theory. A serious referee could verify the proofs and the tightness of the rates.

I would send it to peer review.

Referee Report

0 major / 2 minor

Summary. The paper develops a norm-based complexity framework for overparameterized deep networks and derives approximation rates together with excess risk bounds for sparse compositional target functions whose structure is encoded by directed acyclic graphs (DAGs). The bounds are obtained for Frobenius-norm-constrained networks and are claimed to hold for any efficiently Turing-computable function, thereby covering multi-index models, binary-tree compositions, and general hierarchical architectures while avoiding the curse of dimensionality.

Significance. If the stated rates and bounds are valid, the work supplies a concrete theoretical account of how norm constraints can control complexity in the overparameterized regime and how DAG-structured compositional representations permit dimension-free learning. The explicit link to Turing-computable functions broadens the scope beyond the usual hand-crafted compositional examples and supplies a unified treatment of several standard model classes.

minor comments (2)

The abstract states that the rates 'show that deep networks can exploit the compositional structure,' yet the precise dependence of the constants on the DAG depth, width, and sparsity parameters is not summarized; a short table or corollary collecting the leading terms would improve readability.
Notation for the Frobenius-norm ball and the DAG-induced function class is introduced without an explicit comparison to the more common spectral-norm or path-norm constraints used in related compositional analyses; a brief remark on why the Frobenius choice yields non-vacuous bounds would clarify the contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear under the MAJOR COMMENTS section of the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation establishes approximation rates and excess risk bounds directly from the Frobenius norm constraint applied to networks representing DAG-structured sparse compositional functions. These bounds follow from standard norm-based complexity measures on the assumed target class without any reduction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The claim that efficiently Turing-computable functions admit such representations serves as broad motivation rather than a circular premise in the core bounds. The argument remains self-contained against external benchmarks for compositional approximation theory.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5706 in / 1105 out tokens · 32203 ms · 2026-06-29T20:30:46.346688+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Representation Costs in Data Science: Foundations and the Quasi-Banach Spaces of Deep Neural Networks
math.FA 2026-06 unverdicted novelty 7.0

Develops general framework for representation costs of parametric models, proving that depth-L ReLU networks induce p-normable quasi-Banach spaces with p=2/L.

Reference graph

Works this paper leans on

95 extracted references · 21 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. InConference on Learning Theory, pages 4782–4887. PMLR, 2022

2022
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019

Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019

2019
[4]

Cambridge University Press, 2009

MartinAnthonyandPeterLBartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 2009

2009
[5]

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018

2018
[6]

Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(1):629–681, 2017

Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(1):629–681, 2017

2017
[7]

Local rademacher complexities.Annals of Statistics, 33(4):1497–1537, 2005

Peter Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.Annals of Statistics, 33(4):1497–1537, 2005

2005
[8]

Rademacher and gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3(Nov):463–482, 2002

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3(Nov):463–482, 2002

2002
[9]

Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

2017
[10]

Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

2020
[11]

On deep learning as a remedy for the curse of dimensionality in nonparametric regression.The Annals of Statistics, 47(4):2261–2285, 2019

Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression.The Annals of Statistics, 47(4):2261–2285, 2019

2019
[12]

What size net gives valid generalization?Advances in Neural Information Processing Systems, 1, 1988

Eric Baum and David Haussler. What size net gives valid generalization?Advances in Neural Information Processing Systems, 1, 1988

1988
[13]

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation.Acta Numerica, 30:203–248, 2021

Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation.Acta Numerica, 30:203–248, 2021

2021
[14]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

2019
[15]

Recognition-by-components: a theory of human image understanding.Psycho- logical review, 94(2):115, 1987

Irving Biederman. Recognition-by-components: a theory of human image understanding.Psycho- logical review, 94(2):115, 1987

1987
[16]

On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793, 2023

Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793, 2023

work page arXiv 2023
[17]

How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024

Francesco Cagnetta, Leonardo Petrini, Umberto M Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024

2024
[18]

automatically

Yunlu Chen, Yang Li, Keli Liu, and Feng Ruan. Kernel learning in ridge regression "automatically" yields exact low rank solution.arXiv preprint arXiv:2310.11736, 2023

work page arXiv 2023
[19]

On lazy training in differentiable programming

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019

2019
[20]

Three models for the description of language.IRE Transactions on information theory, 2(3):113–124, 1956

Noam Chomsky. Three models for the description of language.IRE Transactions on information theory, 2(3):113–124, 1956

1956
[21]

Compositional sparsity, approximation classes, and parametric transport equations.Constructive Approximation, 61(2):219–283, 2025

Wolfgang Dahmen. Compositional sparsity, approximation classes, and parametric transport equations.Constructive Approximation, 61(2):219–283, 2025

2025
[22]

Computational-statistical gaps in gaussian single-index models

Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in gaussian single-index models. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1262–1262. PMLR, 2024. 11

2024
[23]

The computational advantage of depth: Learning high-dimensional hierarchical functions with gradient descent.arXiv preprint arXiv:2502.13961, 2025

Yatin Dandi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The computational advantage of depth: Learning high-dimensional hierarchical functions with gradient descent.arXiv preprint arXiv:2502.13961, 2025

work page arXiv 2025
[24]

Position: A theory of deep learning must include compositional sparsity.arXiv preprint arXiv:2507.02550, 2025

David A Danhofer, Davide D’Ascenzo, Rafael Dubach, and Tomaso Poggio. Position: A theory of deep learning must include compositional sparsity.arXiv preprint arXiv:2507.02550, 2025

work page arXiv 2025
[25]

Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026

Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, and Antoine Maillard. Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026

work page arXiv 2026
[26]

How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012

James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012

2012
[27]

High-dimensional data analysis: The curses and blessings of dimensionality

David L Donoho et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000

2000
[28]

Theory of deep convolutional neural networks ii: Spherical analysis.Neural Networks, 131:154–162, 2020

Zhiying Fang, Han Feng, Shuo Huang, and Ding-Xuan Zhou. Theory of deep convolutional neural networks ii: Spherical analysis.Neural Networks, 131:154–162, 2020

2020
[29]

Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

1991
[30]

Generalization analysis of cnns for classification on spheres.IEEE Transactions on Neural Networks and Learning Systems, 34(9):6200–6213, 2023

Han Feng, Shuo Huang, and Ding-Xuan Zhou. Generalization analysis of cnns for classification on spheres.IEEE Transactions on Neural Networks and Learning Systems, 34(9):6200–6213, 2023

2023
[31]

Kernel dimension reduction in regression

Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Kernel dimension reduction in regression. The Annals of Statistics, pages 1871–1905, 2009

1905
[32]

Norm-based generalization bounds for compositionally sparse neural networks.arXiv preprint arXiv:2301.12033, 2023

Tomer Galanti, Mengjia Xu, Liane Galanti, and Tomaso Poggio. Norm-based generalization bounds for compositionally sparse neural networks.arXiv preprint arXiv:2301.12033, 2023

work page arXiv 2023
[33]

Size-independent sample complexity of neural networks.Information and Inference: A Journal of the IMA, 9(2):473–504, 2020

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks.Information and Inference: A Journal of the IMA, 9(2):473–504, 2020

2020
[34]

The human visual cortex.Annu

Kalanit Grill-Spector and Rafael Malach. The human visual cortex.Annu. Rev. Neurosci., 27(1): 649–677, 2004

2004
[35]

Implicit bias of gradient descent on linear convolutional networks.Advances in Neural Information Processing Systems, 31, 2018

Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks.Advances in Neural Information Processing Systems, 31, 2018

2018
[36]

Springer Science & Business Media, 2006

László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

2006
[37]

Depth selection for deep relu nets in feature extraction and generalization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1853–1868, 2020

Zhi Han, Siquan Yu, Shao-Bo Lin, and Ding-Xuan Zhou. Depth selection for deep relu nets in feature extraction and generalization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1853–1868, 2020

2020
[38]

Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

2022
[39]

Learning multi-index models with hyper-kernel ridge regression.arXiv preprint arXiv:2510.02532, 2025

Shuo Huang, Hippolyte Labarrière, Ernesto De Vito, Tomaso Poggio, and Lorenzo Rosasco. Learning multi-index models with hyper-kernel ridge regression.arXiv preprint arXiv:2510.02532, 2025

work page arXiv 2025
[40]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

2018
[41]

Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 33:17176–17186, 2020

Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 33:17176–17186, 2020

2020
[42]

Approximation bounds for norm constrained neural networks with applications to regression and gans.Applied and Computational Harmonic Analysis, 65:249–278, 2023

Yuling Jiao, Yang Wang, and Yunfei Yang. Approximation bounds for norm constrained neural networks with applications to regression and gans.Applied and Computational Harmonic Analysis, 65:249–278, 2023

2023
[43]

Nonparametric estimation of composite functions.Annals of Statistics, 37(3):1360–1404, 2009

Anatoli B Juditsky, Oleg Lepski, and Alexandre B Tsybakov. Nonparametric estimation of composite functions.Annals of Statistics, 37(3):1360–1404, 2009. 12

2009
[44]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[45]

Estimating multi-index models with response-conditional least squares.Electronic Journal of Statistics, 15(1):589–629, 2021

T Klock, A Lanteri, and S Vigogna. Estimating multi-index models with response-conditional least squares.Electronic Journal of Statistics, 15(1):589–629, 2021

2021
[46]

Analysis of convolutional neural network image classifiers in a rotationally symmetric model.IEEE Transactions on Information Theory, 69(8):5203–5218, 2023

Michael Kohler and Benjamin Kohler. Analysis of convolutional neural network image classifiers in a rotationally symmetric model.IEEE Transactions on Information Theory, 69(8):5203–5218, 2023

2023
[47]

On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

2021
[48]

Estimation of a function of low local dimensionality by deep neural networks.IEEE Transactions on Information Theory, 68(6): 4032–4042, 2022

Michael Kohler, Adam Krzyżak, and Sophie Langer. Estimation of a function of low local dimensionality by deep neural networks.IEEE Transactions on Information Theory, 68(6): 4032–4042, 2022

2022
[49]

Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

2012
[50]

Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. InAdvances in Neural Information Processing Systems, volume 4, pages 950–957. Morgan Kaufmann, 1991

1991
[51]

Springer Science & Business Media, 1991

Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer Science & Business Media, 1991

1991
[52]

Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86(414):316–327, 1991

Ker-Chau Li. Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86(414):316–327, 1991

1991
[53]

Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

2021
[54]

Approximating functions with multi-features by deep convolutional neural networks.Analysis and Applications, 21(01):93–125, 2023

Tong Mao, Zhongjie Shi, and Ding-Xuan Zhou. Approximating functions with multi-features by deep convolutional neural networks.Analysis and Applications, 21(01):93–125, 2023

2023
[55]

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. InConference on learning theory, pages 2388–2464. PMLR, 2019

2019
[56]

When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

2017
[57]

MIT press, 2018

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018

2018
[58]

New error bounds for deep relu networks using sparse grids

Hadrien Montanelli and Qiang Du. New error bounds for deep relu networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78–92, 2019

2019
[59]

Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension.arXiv preprint arXiv:2602.03539, 2026

Thomas Nagler and Sophie Langer. Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension.arXiv preprint arXiv:2602.03539, 2026

work page arXiv 2026
[60]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[61]

Norm-based capacity control in neural networks

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. InConference on learning theory, pages 1376–1401. PMLR, 2015

2015
[62]

Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

2022
[63]

Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999

Allan Pinkus. Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999. 13

1999
[64]

On efficiently computable functions, deep networks and sparse compositionality

Tomaso Poggio. On efficiently computable functions, deep networks and sparse compositionality. arXiv preprint arXiv:2510.11942, 2025

work page arXiv 2025
[65]

Compositional sparsity of learnable functions.Bulletin of the American Mathematical Society, 61(3):438–456, 2024

Tomaso Poggio and Maia Fraser. Compositional sparsity of learnable functions.Bulletin of the American Mathematical Society, 61(3):438–456, 2024

2024
[66]

Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017

Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017

2017
[67]

Mecha- nism for feature learning in neural networks and backpropagation-free machine learning models

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mecha- nism for feature learning in neural networks and backpropagation-free machine learning models. Science, 383(6690):1461–1467, 2024

2024
[68]

Neural Networks With Dense Weights Are Not Universal Approximators

Levi Rauchwerger, Stefanie Jegelka, and Ron Levie. Dense neural networks are not universal approximators.arXiv preprint arXiv:2602.07618, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining

Yunwei Ren, Yatin Dandi, Florent Krzakala, and Jason D Lee. Provable learning of random hierarchy models and hierarchical shallow-to-deep chaining.arXiv preprint arXiv:2601.19756, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

Nonparametricregressionusingdeepneuralnetworkswithreluactivation function.The Annals of Statistics, 48(4):1875–1897, 2020

JohannesSchmidt-Hieber. Nonparametricregressionusingdeepneuralnetworkswithreluactivation function.The Annals of Statistics, 48(4):1875–1897, 2020

2020
[71]

A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1): e2408799121, 2025

Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1): e2408799121, 2025

2025
[72]

Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497, 2019

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497, 2019

work page arXiv 1906
[73]

Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

Zhongjie Shi, Zhiying Fang, and Yuan Cao. Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

2025
[74]

Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks.Foundations of Computational Mathematics, 24(2):481–537, 2024

Jonathan W Siegel and Jinchao Xu. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks.Foundations of Computational Mathematics, 24(2):481–537, 2024

2024
[75]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70): 1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70): 1–57, 2018

2018
[76]

Springer Science & Business Media, 2008

Ingo Steinwart and Andreas Christmann.Support vector machines. Springer Science & Business Media, 2008

2008
[77]

Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality

Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. InInternational Conference on Learning Representations, 2019

2019
[78]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Deep learning and the information bottleneck principle

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. Ieee, 2015

2015
[80]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. InConference on Learning Theory, pages 4782–4887. PMLR, 2022

2022

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019

Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019

2019

[4] [4]

Cambridge University Press, 2009

MartinAnthonyandPeterLBartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 2009

2009

[5] [5]

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018

2018

[6] [6]

Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(1):629–681, 2017

Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(1):629–681, 2017

2017

[7] [7]

Local rademacher complexities.Annals of Statistics, 33(4):1497–1537, 2005

Peter Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.Annals of Statistics, 33(4):1497–1537, 2005

2005

[8] [8]

Rademacher and gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3(Nov):463–482, 2002

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3(Nov):463–482, 2002

2002

[9] [9]

Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

2017

[10] [10]

Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

2020

[11] [11]

On deep learning as a remedy for the curse of dimensionality in nonparametric regression.The Annals of Statistics, 47(4):2261–2285, 2019

Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression.The Annals of Statistics, 47(4):2261–2285, 2019

2019

[12] [12]

What size net gives valid generalization?Advances in Neural Information Processing Systems, 1, 1988

Eric Baum and David Haussler. What size net gives valid generalization?Advances in Neural Information Processing Systems, 1, 1988

1988

[13] [13]

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation.Acta Numerica, 30:203–248, 2021

Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation.Acta Numerica, 30:203–248, 2021

2021

[14] [14]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

2019

[15] [15]

Recognition-by-components: a theory of human image understanding.Psycho- logical review, 94(2):115, 1987

Irving Biederman. Recognition-by-components: a theory of human image understanding.Psycho- logical review, 94(2):115, 1987

1987

[16] [16]

On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793, 2023

Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793, 2023

work page arXiv 2023

[17] [17]

How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024

Francesco Cagnetta, Leonardo Petrini, Umberto M Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024

2024

[18] [18]

automatically

Yunlu Chen, Yang Li, Keli Liu, and Feng Ruan. Kernel learning in ridge regression "automatically" yields exact low rank solution.arXiv preprint arXiv:2310.11736, 2023

work page arXiv 2023

[19] [19]

On lazy training in differentiable programming

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019

2019

[20] [20]

Three models for the description of language.IRE Transactions on information theory, 2(3):113–124, 1956

Noam Chomsky. Three models for the description of language.IRE Transactions on information theory, 2(3):113–124, 1956

1956

[21] [21]

Compositional sparsity, approximation classes, and parametric transport equations.Constructive Approximation, 61(2):219–283, 2025

Wolfgang Dahmen. Compositional sparsity, approximation classes, and parametric transport equations.Constructive Approximation, 61(2):219–283, 2025

2025

[22] [22]

Computational-statistical gaps in gaussian single-index models

Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in gaussian single-index models. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1262–1262. PMLR, 2024. 11

2024

[23] [23]

The computational advantage of depth: Learning high-dimensional hierarchical functions with gradient descent.arXiv preprint arXiv:2502.13961, 2025

Yatin Dandi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The computational advantage of depth: Learning high-dimensional hierarchical functions with gradient descent.arXiv preprint arXiv:2502.13961, 2025

work page arXiv 2025

[24] [24]

Position: A theory of deep learning must include compositional sparsity.arXiv preprint arXiv:2507.02550, 2025

David A Danhofer, Davide D’Ascenzo, Rafael Dubach, and Tomaso Poggio. Position: A theory of deep learning must include compositional sparsity.arXiv preprint arXiv:2507.02550, 2025

work page arXiv 2025

[25] [25]

Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026

Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, and Antoine Maillard. Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026

work page arXiv 2026

[26] [26]

How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012

James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012

2012

[27] [27]

High-dimensional data analysis: The curses and blessings of dimensionality

David L Donoho et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000

2000

[28] [28]

Theory of deep convolutional neural networks ii: Spherical analysis.Neural Networks, 131:154–162, 2020

Zhiying Fang, Han Feng, Shuo Huang, and Ding-Xuan Zhou. Theory of deep convolutional neural networks ii: Spherical analysis.Neural Networks, 131:154–162, 2020

2020

[29] [29]

Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

1991

[30] [30]

Generalization analysis of cnns for classification on spheres.IEEE Transactions on Neural Networks and Learning Systems, 34(9):6200–6213, 2023

Han Feng, Shuo Huang, and Ding-Xuan Zhou. Generalization analysis of cnns for classification on spheres.IEEE Transactions on Neural Networks and Learning Systems, 34(9):6200–6213, 2023

2023

[31] [31]

Kernel dimension reduction in regression

Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Kernel dimension reduction in regression. The Annals of Statistics, pages 1871–1905, 2009

1905

[32] [32]

Norm-based generalization bounds for compositionally sparse neural networks.arXiv preprint arXiv:2301.12033, 2023

Tomer Galanti, Mengjia Xu, Liane Galanti, and Tomaso Poggio. Norm-based generalization bounds for compositionally sparse neural networks.arXiv preprint arXiv:2301.12033, 2023

work page arXiv 2023

[33] [33]

Size-independent sample complexity of neural networks.Information and Inference: A Journal of the IMA, 9(2):473–504, 2020

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks.Information and Inference: A Journal of the IMA, 9(2):473–504, 2020

2020

[34] [34]

The human visual cortex.Annu

Kalanit Grill-Spector and Rafael Malach. The human visual cortex.Annu. Rev. Neurosci., 27(1): 649–677, 2004

2004

[35] [35]

Implicit bias of gradient descent on linear convolutional networks.Advances in Neural Information Processing Systems, 31, 2018

Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks.Advances in Neural Information Processing Systems, 31, 2018

2018

[36] [36]

Springer Science & Business Media, 2006

László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

2006

[37] [37]

Depth selection for deep relu nets in feature extraction and generalization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1853–1868, 2020

Zhi Han, Siquan Yu, Shao-Bo Lin, and Ding-Xuan Zhou. Depth selection for deep relu nets in feature extraction and generalization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1853–1868, 2020

2020

[38] [38]

Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

2022

[39] [39]

Learning multi-index models with hyper-kernel ridge regression.arXiv preprint arXiv:2510.02532, 2025

Shuo Huang, Hippolyte Labarrière, Ernesto De Vito, Tomaso Poggio, and Lorenzo Rosasco. Learning multi-index models with hyper-kernel ridge regression.arXiv preprint arXiv:2510.02532, 2025

work page arXiv 2025

[40] [40]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

2018

[41] [41]

Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 33:17176–17186, 2020

Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 33:17176–17186, 2020

2020

[42] [42]

Approximation bounds for norm constrained neural networks with applications to regression and gans.Applied and Computational Harmonic Analysis, 65:249–278, 2023

Yuling Jiao, Yang Wang, and Yunfei Yang. Approximation bounds for norm constrained neural networks with applications to regression and gans.Applied and Computational Harmonic Analysis, 65:249–278, 2023

2023

[43] [43]

Nonparametric estimation of composite functions.Annals of Statistics, 37(3):1360–1404, 2009

Anatoli B Juditsky, Oleg Lepski, and Alexandre B Tsybakov. Nonparametric estimation of composite functions.Annals of Statistics, 37(3):1360–1404, 2009. 12

2009

[44] [44]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[45] [45]

Estimating multi-index models with response-conditional least squares.Electronic Journal of Statistics, 15(1):589–629, 2021

T Klock, A Lanteri, and S Vigogna. Estimating multi-index models with response-conditional least squares.Electronic Journal of Statistics, 15(1):589–629, 2021

2021

[46] [46]

Analysis of convolutional neural network image classifiers in a rotationally symmetric model.IEEE Transactions on Information Theory, 69(8):5203–5218, 2023

Michael Kohler and Benjamin Kohler. Analysis of convolutional neural network image classifiers in a rotationally symmetric model.IEEE Transactions on Information Theory, 69(8):5203–5218, 2023

2023

[47] [47]

On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

2021

[48] [48]

Estimation of a function of low local dimensionality by deep neural networks.IEEE Transactions on Information Theory, 68(6): 4032–4042, 2022

Michael Kohler, Adam Krzyżak, and Sophie Langer. Estimation of a function of low local dimensionality by deep neural networks.IEEE Transactions on Information Theory, 68(6): 4032–4042, 2022

2022

[49] [49]

Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

2012

[50] [50]

Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. InAdvances in Neural Information Processing Systems, volume 4, pages 950–957. Morgan Kaufmann, 1991

1991

[51] [51]

Springer Science & Business Media, 1991

Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer Science & Business Media, 1991

1991

[52] [52]

Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86(414):316–327, 1991

Ker-Chau Li. Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86(414):316–327, 1991

1991

[53] [53]

Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

2021

[54] [54]

Approximating functions with multi-features by deep convolutional neural networks.Analysis and Applications, 21(01):93–125, 2023

Tong Mao, Zhongjie Shi, and Ding-Xuan Zhou. Approximating functions with multi-features by deep convolutional neural networks.Analysis and Applications, 21(01):93–125, 2023

2023

[55] [55]

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. InConference on learning theory, pages 2388–2464. PMLR, 2019

2019

[56] [56]

When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

2017

[57] [57]

MIT press, 2018

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018

2018

[58] [58]

New error bounds for deep relu networks using sparse grids

Hadrien Montanelli and Qiang Du. New error bounds for deep relu networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78–92, 2019

2019

[59] [59]

Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension.arXiv preprint arXiv:2602.03539, 2026

Thomas Nagler and Sophie Langer. Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension.arXiv preprint arXiv:2602.03539, 2026

work page arXiv 2026

[60] [60]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[61] [61]

Norm-based capacity control in neural networks

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. InConference on learning theory, pages 1376–1401. PMLR, 2015

2015

[62] [62]

Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

2022

[63] [63]

Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999

Allan Pinkus. Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999. 13

1999

[64] [64]

On efficiently computable functions, deep networks and sparse compositionality

Tomaso Poggio. On efficiently computable functions, deep networks and sparse compositionality. arXiv preprint arXiv:2510.11942, 2025

work page arXiv 2025

[65] [65]

Compositional sparsity of learnable functions.Bulletin of the American Mathematical Society, 61(3):438–456, 2024

Tomaso Poggio and Maia Fraser. Compositional sparsity of learnable functions.Bulletin of the American Mathematical Society, 61(3):438–456, 2024

2024

[66] [66]

Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017

Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017

2017

[67] [67]

Mecha- nism for feature learning in neural networks and backpropagation-free machine learning models

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mecha- nism for feature learning in neural networks and backpropagation-free machine learning models. Science, 383(6690):1461–1467, 2024

2024

[68] [68]

Neural Networks With Dense Weights Are Not Universal Approximators

Levi Rauchwerger, Stefanie Jegelka, and Ron Levie. Dense neural networks are not universal approximators.arXiv preprint arXiv:2602.07618, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [69]

Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining

Yunwei Ren, Yatin Dandi, Florent Krzakala, and Jason D Lee. Provable learning of random hierarchy models and hierarchical shallow-to-deep chaining.arXiv preprint arXiv:2601.19756, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[70] [70]

Nonparametricregressionusingdeepneuralnetworkswithreluactivation function.The Annals of Statistics, 48(4):1875–1897, 2020

JohannesSchmidt-Hieber. Nonparametricregressionusingdeepneuralnetworkswithreluactivation function.The Annals of Statistics, 48(4):1875–1897, 2020

2020

[71] [71]

A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1): e2408799121, 2025

Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1): e2408799121, 2025

2025

[72] [72]

Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497, 2019

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497, 2019

work page arXiv 1906

[73] [73]

Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

Zhongjie Shi, Zhiying Fang, and Yuan Cao. Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

2025

[74] [74]

Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks.Foundations of Computational Mathematics, 24(2):481–537, 2024

Jonathan W Siegel and Jinchao Xu. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks.Foundations of Computational Mathematics, 24(2):481–537, 2024

2024

[75] [75]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70): 1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70): 1–57, 2018

2018

[76] [76]

Springer Science & Business Media, 2008

Ingo Steinwart and Andreas Christmann.Support vector machines. Springer Science & Business Media, 2008

2008

[77] [77]

Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality

Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. InInternational Conference on Learning Representations, 2019

2019

[78] [78]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [79]

Deep learning and the information bottleneck principle

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. Ieee, 2015

2015

[80] [80]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023