Approximation Theory for Neural Networks: Old and New
Pith reviewed 2026-05-21 05:35 UTC · model grok-4.3
The pith
Deeper neural networks can approximate structured functions using far fewer parameters than shallow ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Universal approximation theorems establish that feedforward neural networks are dense in continuous functions, L^p spaces, and Sobolev spaces under mild activation conditions, while quantitative extensions demonstrate that for structured target functions deeper networks achieve given approximation error with superior parameter counts compared to wider but shallower networks.
What carries the argument
Depth-width trade-offs in quantitative approximation bounds relating error to total parameters and smoothness of the target function.
Load-bearing premise
The reviewed results assume target functions possess sufficient smoothness or structure for the stated depth advantages to hold.
What would settle it
A specific structured function class together with explicit network constructions where increasing depth fails to reduce the parameter count needed for a fixed error would falsify the efficiency claim.
Figures
read the original abstract
Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of $\mathbb{R}^d$, $L^p$ spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth--width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov--Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey reviewing universal approximation theorems for neural networks, classical density results for single-hidden-layer networks, quantitative approximation rates under smoothness assumptions, depth-width trade-offs showing superior parameter efficiency for deeper architectures on structured function classes, and recent approximation-theoretic results for Kolmogorov-Arnold Networks (KANs).
Significance. If the summaries of existing theorems are accurate, the paper offers a useful compilation of results from classical universal approximation to modern quantitative analyses of depth advantages and alternative architectures. This can serve as a reference for understanding parameter efficiency under Sobolev or compositional smoothness assumptions and for contextualizing KANs within approximation theory.
major comments (1)
- [Depth-width trade-offs section] The central claim regarding depth-width trade-offs for structured functions is presented as a summary of prior theorems; however, without explicit cross-references to the precise assumptions (e.g., compositional smoothness vs. Sobolev) in each cited result, it is difficult to assess the scope of the claimed superiority. A dedicated subsection comparing the function classes across key theorems would strengthen the presentation.
minor comments (2)
- Notation for network parameters (depth, width, total parameters) is introduced but not always used consistently when comparing bounds across different architectures; a table summarizing the rates would improve clarity.
- [KANs section] The discussion of KANs would benefit from explicit mention of any known limitations or open questions in their approximation theory relative to standard feedforward networks.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our survey and the constructive recommendation for minor revision. We agree that the depth-width trade-offs section can be strengthened by making the underlying function-class assumptions more explicit.
read point-by-point responses
-
Referee: [Depth-width trade-offs section] The central claim regarding depth-width trade-offs for structured functions is presented as a summary of prior theorems; however, without explicit cross-references to the precise assumptions (e.g., compositional smoothness vs. Sobolev) in each cited result, it is difficult to assess the scope of the claimed superiority. A dedicated subsection comparing the function classes across key theorems would strengthen the presentation.
Authors: We thank the referee for this observation. We agree that the manuscript would benefit from greater transparency regarding the assumptions in the cited depth-width results. In the revised version we will add a dedicated subsection (provisionally titled 'Comparison of Function Classes in Depth-Width Trade-offs') that systematically lists the smoothness and structural hypotheses (compositional smoothness, Sobolev regularity, etc.) for each key theorem, supplies explicit cross-references to the original statements, and includes a concise comparison table. This addition will clarify the scope of the claimed parameter-efficiency advantages without altering any of the existing theorems or proofs. revision: yes
Circularity Check
No significant circularity in literature survey
full rationale
This manuscript is a survey paper that compiles and reviews classical and recent external results on neural network approximation theory, universal approximation theorems, depth-width trade-offs, and Kolmogorov-Arnold networks. No original derivations, quantitative predictions, or fitted parameters are introduced by the authors. All central claims about parameter efficiency for structured function classes are explicitly presented as summaries of existing theorems from the literature (under standard smoothness assumptions), with no self-referential reductions or load-bearing self-citations that collapse the argument to the paper's own inputs. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Universal approximation theorems... depth–width trade-offs... Kolmogorov–Arnold Networks
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Barron norm... O(m^{-1/2}) rates... (p,C)-smooth functions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Arnol'd, V. I. (1957). On functions of three variables . In Doklady Akademii Nauk , volume 114, pages 679--681. Russian Academy of Sciences
work page 1957
-
[2]
Arnol'd, V. I. (1959). On the representation of continuous functions of three variables by superpositions of continuous functions of two variables . Matematicheskii Sbornik , 90(1):3--74
work page 1959
-
[3]
Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function . IEEE Transactions on Information Theory , 39(3):930–945
work page 1993
-
[4]
Baum, E. B. (1988). On the capabilities of multilayer perceptrons . Journal of Complexity , 4(3):193--215
work page 1988
-
[5]
Bianchini, M. and Scarselli, F. (2014). On the complexity of neural network classifiers: A comparison between shallow and deep architectures . IEEE Transactions on Neural Networks and Learning Systems , 25(8):1553--1565
work page 2014
-
[6]
Cai, Y. (2023). Achieve the Minimum Width of Neural Networks for Universal Approximation . In The Eleventh International Conference on Learning Representations
work page 2023
-
[7]
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function . Mathematics of Control, Signals, and Systems , 2(4):303–314
work page 1989
-
[8]
Daubechies, I., DeVore, R., Foucart, S., Hanin, B., and Petrova, G. (2022). Nonlinear approximation and (deep) ReLU networks . Constructive Approximation , 55(1):127--172
work page 2022
-
[9]
De Boor, C. (1978). A practical guide to splines , volume 27. Springer-Verlag New York
work page 1978
-
[10]
De Ryck, T., Lanthaler, S., and Mishra, S. (2021). On the approximation of functions by tanh neural networks . Neural Networks , 143:732--750
work page 2021
-
[11]
DeVore, R., Hanin, B., and Petrova, G. (2021). Neural network approximation . Acta Numerica , 30:327--444
work page 2021
-
[12]
E, W., Ma, C., and Wu, L. (2021). The Barron Space and the Flow-Induced Function Spaces for Neural Network Models . Constructive Approximation , 55(1):369–406
work page 2021
-
[13]
Eldan, R. and Shamir, O. (2016). The Power of Depth for Feedforward Neural Networks . In Feldman, V., Rakhlin, A., and Shamir, O., editors, 29th Annual Conference on Learning Theory , volume 49 of Proceedings of Machine Learning Research , pages 907--940, Columbia University, New York, New York, USA. PMLR
work page 2016
-
[14]
Funahashi, K.-I. (1989). On the approximate realization of continuous mappings by neural networks . Neural Networks , 2(3):183--192
work page 1989
-
[15]
Girosi, F. and Poggio, T. (1989). Representation properties of networks: Kolmogorov's theorem is irrelevant . Neural Computation , 1(4):465--469
work page 1989
-
[16]
Approximating Continuous Functions by ReLU Nets of Minimal Width
Hanin, B. and Sellke, M. (2017). Approximating continuous functions by relu nets of minimal width . arXiv preprint arXiv:1710.11278
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Hecht-Nielsen, R. (1987). Kolmogorov’s mapping neural network existence theorem . In Proceedings of the international conference on Neural Networks , volume 3, pages 11--14. IEEE press New York, NY, USA
work page 1987
-
[18]
Hilbert, D. (1902). Mathematical problems . Bulletin of the American Mathematical Society , 8(10):437–479
work page 1902
-
[19]
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks . Neural Networks , 4(2):251--257
work page 1991
-
[20]
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators . Neural Networks , 2(5):359--366
work page 1989
-
[21]
Hwang, G. (2025). Optimal Minimum Width for the Universal Approximation of Continuously Differentiable Functions by Deep Narrow MLP s . In The Thirty-ninth Annual Conference on Neural Information Processing Systems
work page 2025
-
[22]
Igelnik, B. and Parikh, N. (2003). Kolmogorov's spline network . IEEE Transactions on Neural Networks , 14(4):725--733
work page 2003
-
[23]
Johnson, J. (2019). Deep, Skinny Neural Networks are not Universal Approximators . In International Conference on Learning Representations
work page 2019
-
[24]
Kidger, P. and Lyons, T. (2020). Universal approximation with deep narrow networks . In Conference on Learning Theory , pages 2306--2327. PMLR
work page 2020
-
[25]
Kim, N., Min, C., and Park, S. (2024). Minimum width for universal approximation using Re LU networks on compact domain . In The Twelfth International Conference on Learning Representations
work page 2024
-
[26]
Klusowski, J. M. and Barron, A. R. (2016). Risk bounds for high-dimensional ridge function combinations including neural networks . arXiv preprint arXiv:1607.01434
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Kohler, M. and Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates . The Annals of Statistics , 49(4):2231--2249
work page 2021
-
[28]
Kolmogorov, A. N. (1956). On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables . In Dokl. Akad. Nauk USSR , volume 108, pages 179--192
work page 1956
-
[29]
Kolmogorov, A. N. (1957). On the representations of continuous functions of many variables by superposition of continuous functions of one variable and addition . In Dokl. Akad. Nauk USSR , volume 114, pages 953--956
work page 1957
-
[30]
Kratsios, A. and Furuya, T. (2025). Kolmogorov-Arnold Networks: Approximation and learning guarantees for functions and their derivatives . arXiv preprint
work page 2025
-
[31]
K u rkov \'a , V. (1991). Kolmogorov's theorem is relevant . Neural Computation , 3(4):617--622
work page 1991
-
[32]
K u rkov \'a , V. (1992). Kolmogorov's theorem and multilayer neural networks . Neural Networks , 5(3):501--506
work page 1992
-
[33]
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning . Nature , 521(7553):436--444
work page 2015
-
[34]
Y., Pinkus, A., and Schocken, S
Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function . Neural Networks , 6(6):861–867
work page 1993
-
[35]
Li, L., Duan, Y., Ji, G., and Cai, Y. (2023). Minimum Width of Leaky-ReLU Neural Networks for Uniform Universal Approximation . In Proceedings of the 40th International Conference on Machine Learning , pages 19460--19470. PMLR
work page 2023
-
[36]
Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljacic, M., Hou, T. Y., and Tegmark, M. (2025). KAN : Kolmogorov Arnold Networks . In The Thirteenth International Conference on Learning Representations
work page 2025
-
[37]
Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural networks: A view from the width . Advances in Neural Information Processing Systems , 30
work page 2017
-
[38]
Makovoz, Y. (1996). Random Approximants and Neural Networks . Journal of Approximation Theory , 85(1):98–109
work page 1996
-
[39]
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity . The Bulletin of Mathematical Biophysics , 5(4):115--133
work page 1943
-
[40]
Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions . Neural Computation , 8(1):164--177
work page 1996
-
[41]
Mont \'u far, G., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions of deep neural networks . Advances in Neural Information Processing Systems , 27
work page 2014
-
[42]
Morris, S. (2021). Hilbert 13: Are there any genuine continuous multivariate real-valued functions? Bulletin of the American Mathematical Society , 58(1):107--118
work page 2021
-
[43]
Park, S., Yun, C., Lee, J., and Shin, J. (2021). Minimum Width for Universal Approximation . In International Conference on Learning Representations
work page 2021
-
[44]
Petersen, P. and Voigtlaender, F. (2018). Optimal approximation of piecewise smooth functions using deep ReLU neural networks . Neural Networks , 108:296--330
work page 2018
-
[45]
Pinkus, A. (1999). Approximation theory of the MLP model in neural networks . Acta Numerica , 8:143–195
work page 1999
-
[46]
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponential expressivity in deep neural networks through transient chaos . Advances in Neural Information Processing Systems , 29
work page 2016
-
[47]
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. (2017). On the Expressive Power of Deep Neural Networks . In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 2847--2854. PMLR
work page 2017
- [48]
-
[49]
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review , 65(6):386
work page 1958
-
[50]
Scarselli, F. and Chung Tsoi, A. (1998). Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results . Neural Networks , 11(1):15–37
work page 1998
-
[51]
Schumaker, L. (2007). Spline functions: basic theory . Cambridge University Press
work page 2007
-
[52]
Schwartz, L. (1944). Sur certaines familles non fondamentales de fonctions continues . Bulletin de la Soci \'e t \'e Math \'e matique de France , 72:141--145
work page 1944
-
[53]
Siegel, J. W. (2023). Optimal approximation rates for deep ReLU neural networks on Sobolev and Besov spaces . Journal of Machine Learning Research , 24(357):1--52
work page 2023
-
[54]
Siegel, J. W. and Xu, J. (2022). High-order approximation rates for shallow neural networks with cosine and ReLUk activation functions . Applied and Computational Harmonic Analysis , 58:1--26
work page 2022
-
[55]
Telgarsky, M. (2015). Representation benefits of deep feedforward networks . arXiv preprint arXiv:1509.08101
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[56]
Telgarsky, M. (2016). Benefits of depth in neural networks . In Conference on Learning Theory , pages 1517--1539. PMLR
work page 2016
-
[57]
Telgarsky, M. (2021). Deep learning theory lecture notes . https://mjt.cs.illinois.edu/dlt/. Version: 2021-10-27 v0.0-e7150f2d (alpha)
work page 2021
-
[58]
Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks . Neural Networks , 94:103--114
work page 2017
-
[59]
Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep ReLU networks . In Conference on Learning Theory , pages 639--649. PMLR
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.