pith. sign in

arxiv: 2605.25608 · v1 · pith:PSIBWK5Hnew · submitted 2026-05-25 · 📊 stat.ML · cs.LG

Learning Sparse Compositional Functions with Norm-Constrained Neural Networks

Pith reviewed 2026-06-29 20:30 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords sparse compositional functionsdirected acyclic graphsFrobenius normdeep neural networksapproximation ratesexcess risk boundscurse of dimensionalityoverparameterized regimes
0
0 comments X

The pith

Frobenius norm-constrained deep neural networks achieve approximation rates and excess risk bounds for sparse compositional functions represented by DAGs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes approximation rates and excess risk bounds for learning sparse compositional functions whose structure is captured by directed acyclic graphs, using deep neural networks whose parameters are constrained in Frobenius norm. This framework measures complexity through the norm rather than parameter count, allowing non-vacuous guarantees in overparameterized regimes where parameters exceed samples. The approach covers multi-index models, binary tree structures, and general compositional architectures because every efficiently Turing-computable function admits sparse compositional representations via DAGs. The derived rates demonstrate that networks exploit hierarchical structure to avoid the curse of dimensionality.

Core claim

We establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representati

What carries the argument

Frobenius norm-constrained deep neural networks applied to DAG representations of sparse compositional structure.

If this is right

  • Deep networks exploit the compositional structure of target functions to avoid the curse of dimensionality.
  • The framework applies to multi-index models, binary tree structures, and general compositional architectures.
  • Every efficiently Turing computable function admits sparse compositional representations via DAGs.
  • The norm-based complexity measure produces non-vacuous bounds when the number of parameters exceeds the sample size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regularization that explicitly controls Frobenius norm during training may be especially effective for tasks with hidden compositional structure.
  • The DAG representation could be relaxed to allow approximate or noisy compositional graphs while retaining similar rates.
  • Similar norm-based analysis might extend to recurrent or attention-based architectures that also process hierarchical data.

Load-bearing premise

The target functions admit sparse compositional representations via DAGs and the Frobenius norm of network parameters provides an appropriate complexity measure that yields non-vacuous bounds in the overparameterized regime.

What would settle it

A concrete sparse compositional function on a DAG for which the approximation rate or excess risk bound fails to improve over unstructured high-dimensional learning when the network is constrained only by Frobenius norm.

Figures

Figures reproduced from arXiv: 2605.25608 by Lorenzo Fiorito, Lorenzo Rosasco, Shuo Huang, Tomaso Poggio.

Figure 1
Figure 1. Figure 1: Illustration of a compositional function defined on a DAG (left) and its realization via [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Binary-tree construction of the monomial approximator, illustrated for the case [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Partition of Unity via Localized Hat Functions. The plot illustrates the basis functions [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
read the original abstract

The ability of deep neural networks to learn hierarchical features is widely regarded as a key mechanism underlying their success in high-dimensional learning. Existing theory partially supports this view by establishing approximation rates based on parameter counts and sample complexity guarantees for compositional models without incurring the curse of dimensionality (CoD). To study overparameterized regimes, where the number of parameters exceeds the sample size, we develop a framework that measures complexity via the parameter norm. Within this approach, we establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper develops a norm-based complexity framework for overparameterized deep networks and derives approximation rates together with excess risk bounds for sparse compositional target functions whose structure is encoded by directed acyclic graphs (DAGs). The bounds are obtained for Frobenius-norm-constrained networks and are claimed to hold for any efficiently Turing-computable function, thereby covering multi-index models, binary-tree compositions, and general hierarchical architectures while avoiding the curse of dimensionality.

Significance. If the stated rates and bounds are valid, the work supplies a concrete theoretical account of how norm constraints can control complexity in the overparameterized regime and how DAG-structured compositional representations permit dimension-free learning. The explicit link to Turing-computable functions broadens the scope beyond the usual hand-crafted compositional examples and supplies a unified treatment of several standard model classes.

minor comments (2)
  1. The abstract states that the rates 'show that deep networks can exploit the compositional structure,' yet the precise dependence of the constants on the DAG depth, width, and sparsity parameters is not summarized; a short table or corollary collecting the leading terms would improve readability.
  2. Notation for the Frobenius-norm ball and the DAG-induced function class is introduced without an explicit comparison to the more common spectral-norm or path-norm constraints used in related compositional analyses; a brief remark on why the Frobenius choice yields non-vacuous bounds would clarify the contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear under the MAJOR COMMENTS section of the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation establishes approximation rates and excess risk bounds directly from the Frobenius norm constraint applied to networks representing DAG-structured sparse compositional functions. These bounds follow from standard norm-based complexity measures on the assumed target class without any reduction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The claim that efficiently Turing-computable functions admit such representations serves as broad motivation rather than a circular premise in the core bounds. The argument remains self-contained against external benchmarks for compositional approximation theory.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5706 in / 1105 out tokens · 32203 ms · 2026-06-29T20:30:46.346688+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Representation Costs in Data Science: Foundations and the Quasi-Banach Spaces of Deep Neural Networks

    math.FA 2026-06 unverdicted novelty 7.0

    Develops general framework for representation costs of parametric models, proving that depth-L ReLU networks induce p-normable quasi-Banach spaces with p=2/L.

Reference graph

Works this paper leans on

95 extracted references · 21 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks

    Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. InConference on Learning Theory, pages 4782–4887. PMLR, 2022

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 10

  3. [3]

    What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019

    Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019

  4. [4]

    Cambridge University Press, 2009

    MartinAnthonyandPeterLBartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 2009

  5. [5]

    Stronger generalization bounds for deep nets via a compression approach

    Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018

  6. [6]

    Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(1):629–681, 2017

    Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(1):629–681, 2017

  7. [7]

    Local rademacher complexities.Annals of Statistics, 33(4):1497–1537, 2005

    Peter Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.Annals of Statistics, 33(4):1497–1537, 2005

  8. [8]

    Rademacher and gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3(Nov):463–482, 2002

    Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3(Nov):463–482, 2002

  9. [9]

    Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

  10. [10]

    Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

    Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

  11. [11]

    On deep learning as a remedy for the curse of dimensionality in nonparametric regression.The Annals of Statistics, 47(4):2261–2285, 2019

    Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression.The Annals of Statistics, 47(4):2261–2285, 2019

  12. [12]

    What size net gives valid generalization?Advances in Neural Information Processing Systems, 1, 1988

    Eric Baum and David Haussler. What size net gives valid generalization?Advances in Neural Information Processing Systems, 1, 1988

  13. [13]

    Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation.Acta Numerica, 30:203–248, 2021

    Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation.Acta Numerica, 30:203–248, 2021

  14. [14]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

  15. [15]

    Recognition-by-components: a theory of human image understanding.Psycho- logical review, 94(2):115, 1987

    Irving Biederman. Recognition-by-components: a theory of human image understanding.Psycho- logical review, 94(2):115, 1987

  16. [16]

    On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793, 2023

    Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793, 2023

  17. [17]

    How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024

    Francesco Cagnetta, Leonardo Petrini, Umberto M Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024

  18. [18]

    automatically

    Yunlu Chen, Yang Li, Keli Liu, and Feng Ruan. Kernel learning in ridge regression "automatically" yields exact low rank solution.arXiv preprint arXiv:2310.11736, 2023

  19. [19]

    On lazy training in differentiable programming

    Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019

  20. [20]

    Three models for the description of language.IRE Transactions on information theory, 2(3):113–124, 1956

    Noam Chomsky. Three models for the description of language.IRE Transactions on information theory, 2(3):113–124, 1956

  21. [21]

    Compositional sparsity, approximation classes, and parametric transport equations.Constructive Approximation, 61(2):219–283, 2025

    Wolfgang Dahmen. Compositional sparsity, approximation classes, and parametric transport equations.Constructive Approximation, 61(2):219–283, 2025

  22. [22]

    Computational-statistical gaps in gaussian single-index models

    Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in gaussian single-index models. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1262–1262. PMLR, 2024. 11

  23. [23]

    The computational advantage of depth: Learning high-dimensional hierarchical functions with gradient descent.arXiv preprint arXiv:2502.13961, 2025

    Yatin Dandi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The computational advantage of depth: Learning high-dimensional hierarchical functions with gradient descent.arXiv preprint arXiv:2502.13961, 2025

  24. [24]

    Position: A theory of deep learning must include compositional sparsity.arXiv preprint arXiv:2507.02550, 2025

    David A Danhofer, Davide D’Ascenzo, Rafael Dubach, and Tomaso Poggio. Position: A theory of deep learning must include compositional sparsity.arXiv preprint arXiv:2507.02550, 2025

  25. [25]

    Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026

    Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, and Antoine Maillard. Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026

  26. [26]

    How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012

    James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012

  27. [27]

    High-dimensional data analysis: The curses and blessings of dimensionality

    David L Donoho et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000

  28. [28]

    Theory of deep convolutional neural networks ii: Spherical analysis.Neural Networks, 131:154–162, 2020

    Zhiying Fang, Han Feng, Shuo Huang, and Ding-Xuan Zhou. Theory of deep convolutional neural networks ii: Spherical analysis.Neural Networks, 131:154–162, 2020

  29. [29]

    Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

    Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

  30. [30]

    Generalization analysis of cnns for classification on spheres.IEEE Transactions on Neural Networks and Learning Systems, 34(9):6200–6213, 2023

    Han Feng, Shuo Huang, and Ding-Xuan Zhou. Generalization analysis of cnns for classification on spheres.IEEE Transactions on Neural Networks and Learning Systems, 34(9):6200–6213, 2023

  31. [31]

    Kernel dimension reduction in regression

    Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Kernel dimension reduction in regression. The Annals of Statistics, pages 1871–1905, 2009

  32. [32]

    Norm-based generalization bounds for compositionally sparse neural networks.arXiv preprint arXiv:2301.12033, 2023

    Tomer Galanti, Mengjia Xu, Liane Galanti, and Tomaso Poggio. Norm-based generalization bounds for compositionally sparse neural networks.arXiv preprint arXiv:2301.12033, 2023

  33. [33]

    Size-independent sample complexity of neural networks.Information and Inference: A Journal of the IMA, 9(2):473–504, 2020

    Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks.Information and Inference: A Journal of the IMA, 9(2):473–504, 2020

  34. [34]

    The human visual cortex.Annu

    Kalanit Grill-Spector and Rafael Malach. The human visual cortex.Annu. Rev. Neurosci., 27(1): 649–677, 2004

  35. [35]

    Implicit bias of gradient descent on linear convolutional networks.Advances in Neural Information Processing Systems, 31, 2018

    Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks.Advances in Neural Information Processing Systems, 31, 2018

  36. [36]

    Springer Science & Business Media, 2006

    László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

  37. [37]

    Depth selection for deep relu nets in feature extraction and generalization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1853–1868, 2020

    Zhi Han, Siquan Yu, Shao-Bo Lin, and Ding-Xuan Zhou. Depth selection for deep relu nets in feature extraction and generalization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1853–1868, 2020

  38. [38]

    Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

    Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

  39. [39]

    Learning multi-index models with hyper-kernel ridge regression.arXiv preprint arXiv:2510.02532, 2025

    Shuo Huang, Hippolyte Labarrière, Ernesto De Vito, Tomaso Poggio, and Lorenzo Rosasco. Learning multi-index models with hyper-kernel ridge regression.arXiv preprint arXiv:2510.02532, 2025

  40. [40]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

  41. [41]

    Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 33:17176–17186, 2020

    Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 33:17176–17186, 2020

  42. [42]

    Approximation bounds for norm constrained neural networks with applications to regression and gans.Applied and Computational Harmonic Analysis, 65:249–278, 2023

    Yuling Jiao, Yang Wang, and Yunfei Yang. Approximation bounds for norm constrained neural networks with applications to regression and gans.Applied and Computational Harmonic Analysis, 65:249–278, 2023

  43. [43]

    Nonparametric estimation of composite functions.Annals of Statistics, 37(3):1360–1404, 2009

    Anatoli B Juditsky, Oleg Lepski, and Alexandre B Tsybakov. Nonparametric estimation of composite functions.Annals of Statistics, 37(3):1360–1404, 2009. 12

  44. [44]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016

  45. [45]

    Estimating multi-index models with response-conditional least squares.Electronic Journal of Statistics, 15(1):589–629, 2021

    T Klock, A Lanteri, and S Vigogna. Estimating multi-index models with response-conditional least squares.Electronic Journal of Statistics, 15(1):589–629, 2021

  46. [46]

    Analysis of convolutional neural network image classifiers in a rotationally symmetric model.IEEE Transactions on Information Theory, 69(8):5203–5218, 2023

    Michael Kohler and Benjamin Kohler. Analysis of convolutional neural network image classifiers in a rotationally symmetric model.IEEE Transactions on Information Theory, 69(8):5203–5218, 2023

  47. [47]

    On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

    Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

  48. [48]

    Estimation of a function of low local dimensionality by deep neural networks.IEEE Transactions on Information Theory, 68(6): 4032–4042, 2022

    Michael Kohler, Adam Krzyżak, and Sophie Langer. Estimation of a function of low local dimensionality by deep neural networks.IEEE Transactions on Information Theory, 68(6): 4032–4042, 2022

  49. [49]

    Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

  50. [50]

    Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. InAdvances in Neural Information Processing Systems, volume 4, pages 950–957. Morgan Kaufmann, 1991

  51. [51]

    Springer Science & Business Media, 1991

    Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer Science & Business Media, 1991

  52. [52]

    Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86(414):316–327, 1991

    Ker-Chau Li. Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86(414):316–327, 1991

  53. [53]

    Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

    Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

  54. [54]

    Approximating functions with multi-features by deep convolutional neural networks.Analysis and Applications, 21(01):93–125, 2023

    Tong Mao, Zhongjie Shi, and Ding-Xuan Zhou. Approximating functions with multi-features by deep convolutional neural networks.Analysis and Applications, 21(01):93–125, 2023

  55. [55]

    Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

    Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. InConference on learning theory, pages 2388–2464. PMLR, 2019

  56. [56]

    When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

    Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

  57. [57]

    MIT press, 2018

    Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018

  58. [58]

    New error bounds for deep relu networks using sparse grids

    Hadrien Montanelli and Qiang Du. New error bounds for deep relu networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78–92, 2019

  59. [59]

    Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension.arXiv preprint arXiv:2602.03539, 2026

    Thomas Nagler and Sophie Langer. Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension.arXiv preprint arXiv:2602.03539, 2026

  60. [60]

    In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

    Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014

  61. [61]

    Norm-based capacity control in neural networks

    Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. InConference on learning theory, pages 1376–1401. PMLR, 2015

  62. [62]

    Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

    Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

  63. [63]

    Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999

    Allan Pinkus. Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999. 13

  64. [64]

    On efficiently computable functions, deep networks and sparse compositionality

    Tomaso Poggio. On efficiently computable functions, deep networks and sparse compositionality. arXiv preprint arXiv:2510.11942, 2025

  65. [65]

    Compositional sparsity of learnable functions.Bulletin of the American Mathematical Society, 61(3):438–456, 2024

    Tomaso Poggio and Maia Fraser. Compositional sparsity of learnable functions.Bulletin of the American Mathematical Society, 61(3):438–456, 2024

  66. [66]

    Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017

    Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017

  67. [67]

    Mecha- nism for feature learning in neural networks and backpropagation-free machine learning models

    Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mecha- nism for feature learning in neural networks and backpropagation-free machine learning models. Science, 383(6690):1461–1467, 2024

  68. [68]

    Neural Networks With Dense Weights Are Not Universal Approximators

    Levi Rauchwerger, Stefanie Jegelka, and Ron Levie. Dense neural networks are not universal approximators.arXiv preprint arXiv:2602.07618, 2026

  69. [69]

    Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining

    Yunwei Ren, Yatin Dandi, Florent Krzakala, and Jason D Lee. Provable learning of random hierarchy models and hierarchical shallow-to-deep chaining.arXiv preprint arXiv:2601.19756, 2026

  70. [70]

    Nonparametricregressionusingdeepneuralnetworkswithreluactivation function.The Annals of Statistics, 48(4):1875–1897, 2020

    JohannesSchmidt-Hieber. Nonparametricregressionusingdeepneuralnetworkswithreluactivation function.The Annals of Statistics, 48(4):1875–1897, 2020

  71. [71]

    A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1): e2408799121, 2025

    Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1): e2408799121, 2025

  72. [72]

    Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497, 2019

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497, 2019

  73. [73]

    Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

    Zhongjie Shi, Zhiying Fang, and Yuan Cao. Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

  74. [74]

    Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks.Foundations of Computational Mathematics, 24(2):481–537, 2024

    Jonathan W Siegel and Jinchao Xu. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks.Foundations of Computational Mathematics, 24(2):481–537, 2024

  75. [75]

    The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70): 1–57, 2018

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70): 1–57, 2018

  76. [76]

    Springer Science & Business Media, 2008

    Ingo Steinwart and Andreas Christmann.Support vector machines. Springer Science & Business Media, 2008

  77. [77]

    Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality

    Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. InInternational Conference on Learning Representations, 2019

  78. [78]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  79. [79]

    Deep learning and the information bottleneck principle

    Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. Ieee, 2015

  80. [80]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Showing first 80 references.