Learning Sparse Compositional Functions with Norm-Constrained Neural Networks
Pith reviewed 2026-06-29 20:30 UTC · model grok-4.3
The pith
Frobenius norm-constrained deep neural networks achieve approximation rates and excess risk bounds for sparse compositional functions represented by DAGs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representati
What carries the argument
Frobenius norm-constrained deep neural networks applied to DAG representations of sparse compositional structure.
If this is right
- Deep networks exploit the compositional structure of target functions to avoid the curse of dimensionality.
- The framework applies to multi-index models, binary tree structures, and general compositional architectures.
- Every efficiently Turing computable function admits sparse compositional representations via DAGs.
- The norm-based complexity measure produces non-vacuous bounds when the number of parameters exceeds the sample size.
Where Pith is reading between the lines
- Regularization that explicitly controls Frobenius norm during training may be especially effective for tasks with hidden compositional structure.
- The DAG representation could be relaxed to allow approximate or noisy compositional graphs while retaining similar rates.
- Similar norm-based analysis might extend to recurrent or attention-based architectures that also process hierarchical data.
Load-bearing premise
The target functions admit sparse compositional representations via DAGs and the Frobenius norm of network parameters provides an appropriate complexity measure that yields non-vacuous bounds in the overparameterized regime.
What would settle it
A concrete sparse compositional function on a DAG for which the approximation rate or excess risk bound fails to improve over unstructured high-dimensional learning when the network is constrained only by Frobenius norm.
Figures
read the original abstract
The ability of deep neural networks to learn hierarchical features is widely regarded as a key mechanism underlying their success in high-dimensional learning. Existing theory partially supports this view by establishing approximation rates based on parameter counts and sample complexity guarantees for compositional models without incurring the curse of dimensionality (CoD). To study overparameterized regimes, where the number of parameters exceeds the sample size, we develop a framework that measures complexity via the parameter norm. Within this approach, we establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a norm-based complexity framework for overparameterized deep networks and derives approximation rates together with excess risk bounds for sparse compositional target functions whose structure is encoded by directed acyclic graphs (DAGs). The bounds are obtained for Frobenius-norm-constrained networks and are claimed to hold for any efficiently Turing-computable function, thereby covering multi-index models, binary-tree compositions, and general hierarchical architectures while avoiding the curse of dimensionality.
Significance. If the stated rates and bounds are valid, the work supplies a concrete theoretical account of how norm constraints can control complexity in the overparameterized regime and how DAG-structured compositional representations permit dimension-free learning. The explicit link to Turing-computable functions broadens the scope beyond the usual hand-crafted compositional examples and supplies a unified treatment of several standard model classes.
minor comments (2)
- The abstract states that the rates 'show that deep networks can exploit the compositional structure,' yet the precise dependence of the constants on the DAG depth, width, and sparsity parameters is not summarized; a short table or corollary collecting the leading terms would improve readability.
- Notation for the Frobenius-norm ball and the DAG-induced function class is introduced without an explicit comparison to the more common spectral-norm or path-norm constraints used in related compositional analyses; a brief remark on why the Frobenius choice yields non-vacuous bounds would clarify the contribution.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear under the MAJOR COMMENTS section of the report.
Circularity Check
No significant circularity detected
full rationale
The derivation establishes approximation rates and excess risk bounds directly from the Frobenius norm constraint applied to networks representing DAG-structured sparse compositional functions. These bounds follow from standard norm-based complexity measures on the assumed target class without any reduction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The claim that efficiently Turing-computable functions admit such representations serves as broad motivation rather than a circular premise in the core bounds. The argument remains self-contained against external benchmarks for compositional approximation theory.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Representation Costs in Data Science: Foundations and the Quasi-Banach Spaces of Deep Neural Networks
Develops general framework for representation costs of parametric models, proving that depth-L ReLU networks induce p-normable quasi-Banach spaces with p=2/L.
Reference graph
Works this paper leans on
-
[1]
The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks
Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. InConference on Learning Theory, pages 4782–4887. PMLR, 2022
2022
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019
Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019
2019
-
[4]
Cambridge University Press, 2009
MartinAnthonyandPeterLBartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 2009
2009
-
[5]
Stronger generalization bounds for deep nets via a compression approach
Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018
2018
-
[6]
Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(1):629–681, 2017
Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(1):629–681, 2017
2017
-
[7]
Local rademacher complexities.Annals of Statistics, 33(4):1497–1537, 2005
Peter Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.Annals of Statistics, 33(4):1497–1537, 2005
2005
-
[8]
Rademacher and gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3(Nov):463–482, 2002
Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3(Nov):463–482, 2002
2002
-
[9]
Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017
2017
-
[10]
Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020
Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020
2020
-
[11]
On deep learning as a remedy for the curse of dimensionality in nonparametric regression.The Annals of Statistics, 47(4):2261–2285, 2019
Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression.The Annals of Statistics, 47(4):2261–2285, 2019
2019
-
[12]
What size net gives valid generalization?Advances in Neural Information Processing Systems, 1, 1988
Eric Baum and David Haussler. What size net gives valid generalization?Advances in Neural Information Processing Systems, 1, 1988
1988
-
[13]
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation.Acta Numerica, 30:203–248, 2021
Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation.Acta Numerica, 30:203–248, 2021
2021
-
[14]
Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019
2019
-
[15]
Recognition-by-components: a theory of human image understanding.Psycho- logical review, 94(2):115, 1987
Irving Biederman. Recognition-by-components: a theory of human image understanding.Psycho- logical review, 94(2):115, 1987
1987
-
[16]
On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793, 2023
Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793, 2023
-
[17]
How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024
Francesco Cagnetta, Leonardo Petrini, Umberto M Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model.Physical Review X, 14(3):031001, 2024
2024
-
[18]
Yunlu Chen, Yang Li, Keli Liu, and Feng Ruan. Kernel learning in ridge regression "automatically" yields exact low rank solution.arXiv preprint arXiv:2310.11736, 2023
-
[19]
On lazy training in differentiable programming
Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019
2019
-
[20]
Three models for the description of language.IRE Transactions on information theory, 2(3):113–124, 1956
Noam Chomsky. Three models for the description of language.IRE Transactions on information theory, 2(3):113–124, 1956
1956
-
[21]
Compositional sparsity, approximation classes, and parametric transport equations.Constructive Approximation, 61(2):219–283, 2025
Wolfgang Dahmen. Compositional sparsity, approximation classes, and parametric transport equations.Constructive Approximation, 61(2):219–283, 2025
2025
-
[22]
Computational-statistical gaps in gaussian single-index models
Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in gaussian single-index models. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1262–1262. PMLR, 2024. 11
2024
-
[23]
Yatin Dandi, Luca Pesce, Lenka Zdeborová, and Florent Krzakala. The computational advantage of depth: Learning high-dimensional hierarchical functions with gradient descent.arXiv preprint arXiv:2502.13961, 2025
-
[24]
David A Danhofer, Davide D’Ascenzo, Rafael Dubach, and Tomaso Poggio. Position: A theory of deep learning must include compositional sparsity.arXiv preprint arXiv:2507.02550, 2025
-
[25]
Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, and Antoine Maillard. Optimal scaling laws in learning hierarchical multi-index models.arXiv preprint arXiv:2602.05846, 2026
-
[26]
How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012
James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012
2012
-
[27]
High-dimensional data analysis: The curses and blessings of dimensionality
David L Donoho et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000
2000
-
[28]
Theory of deep convolutional neural networks ii: Spherical analysis.Neural Networks, 131:154–162, 2020
Zhiying Fang, Han Feng, Shuo Huang, and Ding-Xuan Zhou. Theory of deep convolutional neural networks ii: Spherical analysis.Neural Networks, 131:154–162, 2020
2020
-
[29]
Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991
Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991
1991
-
[30]
Generalization analysis of cnns for classification on spheres.IEEE Transactions on Neural Networks and Learning Systems, 34(9):6200–6213, 2023
Han Feng, Shuo Huang, and Ding-Xuan Zhou. Generalization analysis of cnns for classification on spheres.IEEE Transactions on Neural Networks and Learning Systems, 34(9):6200–6213, 2023
2023
-
[31]
Kernel dimension reduction in regression
Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Kernel dimension reduction in regression. The Annals of Statistics, pages 1871–1905, 2009
1905
-
[32]
Tomer Galanti, Mengjia Xu, Liane Galanti, and Tomaso Poggio. Norm-based generalization bounds for compositionally sparse neural networks.arXiv preprint arXiv:2301.12033, 2023
-
[33]
Size-independent sample complexity of neural networks.Information and Inference: A Journal of the IMA, 9(2):473–504, 2020
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks.Information and Inference: A Journal of the IMA, 9(2):473–504, 2020
2020
-
[34]
The human visual cortex.Annu
Kalanit Grill-Spector and Rafael Malach. The human visual cortex.Annu. Rev. Neurosci., 27(1): 649–677, 2004
2004
-
[35]
Implicit bias of gradient descent on linear convolutional networks.Advances in Neural Information Processing Systems, 31, 2018
Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks.Advances in Neural Information Processing Systems, 31, 2018
2018
-
[36]
Springer Science & Business Media, 2006
László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006
2006
-
[37]
Depth selection for deep relu nets in feature extraction and generalization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1853–1868, 2020
Zhi Han, Siquan Yu, Shao-Bo Lin, and Ding-Xuan Zhou. Depth selection for deep relu nets in feature extraction and generalization.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):1853–1868, 2020
2020
-
[38]
Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022
2022
-
[39]
Learning multi-index models with hyper-kernel ridge regression.arXiv preprint arXiv:2510.02532, 2025
Shuo Huang, Hippolyte Labarrière, Ernesto De Vito, Tomaso Poggio, and Lorenzo Rosasco. Learning multi-index models with hyper-kernel ridge regression.arXiv preprint arXiv:2510.02532, 2025
-
[40]
Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018
2018
-
[41]
Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 33:17176–17186, 2020
Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 33:17176–17186, 2020
2020
-
[42]
Approximation bounds for norm constrained neural networks with applications to regression and gans.Applied and Computational Harmonic Analysis, 65:249–278, 2023
Yuling Jiao, Yang Wang, and Yunfei Yang. Approximation bounds for norm constrained neural networks with applications to regression and gans.Applied and Computational Harmonic Analysis, 65:249–278, 2023
2023
-
[43]
Nonparametric estimation of composite functions.Annals of Statistics, 37(3):1360–1404, 2009
Anatoli B Juditsky, Oleg Lepski, and Alexandre B Tsybakov. Nonparametric estimation of composite functions.Annals of Statistics, 37(3):1360–1404, 2009. 12
2009
-
[44]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[45]
Estimating multi-index models with response-conditional least squares.Electronic Journal of Statistics, 15(1):589–629, 2021
T Klock, A Lanteri, and S Vigogna. Estimating multi-index models with response-conditional least squares.Electronic Journal of Statistics, 15(1):589–629, 2021
2021
-
[46]
Analysis of convolutional neural network image classifiers in a rotationally symmetric model.IEEE Transactions on Information Theory, 69(8):5203–5218, 2023
Michael Kohler and Benjamin Kohler. Analysis of convolutional neural network image classifiers in a rotationally symmetric model.IEEE Transactions on Information Theory, 69(8):5203–5218, 2023
2023
-
[47]
On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021
Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021
2021
-
[48]
Estimation of a function of low local dimensionality by deep neural networks.IEEE Transactions on Information Theory, 68(6): 4032–4042, 2022
Michael Kohler, Adam Krzyżak, and Sophie Langer. Estimation of a function of low local dimensionality by deep neural networks.IEEE Transactions on Information Theory, 68(6): 4032–4042, 2022
2022
-
[49]
Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012
2012
-
[50]
Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. InAdvances in Neural Information Processing Systems, volume 4, pages 950–957. Morgan Kaufmann, 1991
1991
-
[51]
Springer Science & Business Media, 1991
Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer Science & Business Media, 1991
1991
-
[52]
Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86(414):316–327, 1991
Ker-Chau Li. Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86(414):316–327, 1991
1991
-
[53]
Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021
Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021
2021
-
[54]
Approximating functions with multi-features by deep convolutional neural networks.Analysis and Applications, 21(01):93–125, 2023
Tong Mao, Zhongjie Shi, and Ding-Xuan Zhou. Approximating functions with multi-features by deep convolutional neural networks.Analysis and Applications, 21(01):93–125, 2023
2023
-
[55]
Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit
Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. InConference on learning theory, pages 2388–2464. PMLR, 2019
2019
-
[56]
When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017
Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. When and why are deep networks better than shallow ones? InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017
2017
-
[57]
MIT press, 2018
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018
2018
-
[58]
New error bounds for deep relu networks using sparse grids
Hadrien Montanelli and Qiang Du. New error bounds for deep relu networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78–92, 2019
2019
-
[59]
Thomas Nagler and Sophie Langer. Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension.arXiv preprint arXiv:2602.03539, 2026
-
[60]
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[61]
Norm-based capacity control in neural networks
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. InConference on learning theory, pages 1376–1401. PMLR, 2015
2015
-
[62]
Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022
Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022
2022
-
[63]
Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999
Allan Pinkus. Approximation theory of the mlp model in neural networks.Acta numerica, 8: 143–195, 1999. 13
1999
-
[64]
On efficiently computable functions, deep networks and sparse compositionality
Tomaso Poggio. On efficiently computable functions, deep networks and sparse compositionality. arXiv preprint arXiv:2510.11942, 2025
-
[65]
Compositional sparsity of learnable functions.Bulletin of the American Mathematical Society, 61(3):438–456, 2024
Tomaso Poggio and Maia Fraser. Compositional sparsity of learnable functions.Bulletin of the American Mathematical Society, 61(3):438–456, 2024
2024
-
[66]
Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017
Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017
2017
-
[67]
Mecha- nism for feature learning in neural networks and backpropagation-free machine learning models
Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mecha- nism for feature learning in neural networks and backpropagation-free machine learning models. Science, 383(6690):1461–1467, 2024
2024
-
[68]
Neural Networks With Dense Weights Are Not Universal Approximators
Levi Rauchwerger, Stefanie Jegelka, and Ron Levie. Dense neural networks are not universal approximators.arXiv preprint arXiv:2602.07618, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[69]
Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining
Yunwei Ren, Yatin Dandi, Florent Krzakala, and Jason D Lee. Provable learning of random hierarchy models and hierarchical shallow-to-deep chaining.arXiv preprint arXiv:2601.19756, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[70]
Nonparametricregressionusingdeepneuralnetworkswithreluactivation function.The Annals of Statistics, 48(4):1875–1897, 2020
JohannesSchmidt-Hieber. Nonparametricregressionusingdeepneuralnetworkswithreluactivation function.The Annals of Statistics, 48(4):1875–1897, 2020
2020
-
[71]
A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1): e2408799121, 2025
Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1): e2408799121, 2025
2025
-
[72]
Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497, 2019
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons.arXiv preprint arXiv:1906.05497, 2019
-
[73]
Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025
Zhongjie Shi, Zhiying Fang, and Yuan Cao. Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025
2025
-
[74]
Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks.Foundations of Computational Mathematics, 24(2):481–537, 2024
Jonathan W Siegel and Jinchao Xu. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks.Foundations of Computational Mathematics, 24(2):481–537, 2024
2024
-
[75]
The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70): 1–57, 2018
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70): 1–57, 2018
2018
-
[76]
Springer Science & Business Media, 2008
Ingo Steinwart and Andreas Christmann.Support vector machines. Springer Science & Business Media, 2008
2008
-
[77]
Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality
Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. InInternational Conference on Learning Representations, 2019
2019
-
[78]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Deep learning and the information bottleneck principle
Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. Ieee, 2015
2015
-
[80]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.