Approximation Theory for Neural Networks: Old and New

Himasish Talukdar; Soumendu Sundar Mukherjee

arxiv: 2605.21451 · v1 · pith:P4XSS2QQnew · submitted 2026-05-20 · 💻 cs.LG · cond-mat.dis-nn· cs.AI· cs.NE

Approximation Theory for Neural Networks: Old and New

Soumendu Sundar Mukherjee , Himasish Talukdar This is my paper

Pith reviewed 2026-05-21 05:35 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.AIcs.NE

keywords neural network approximationuniversal approximationdepth-width trade-offparameter efficiencyKolmogorov-Arnold networksquantitative boundssmoothnessfeedforward networks

0 comments

The pith

Deeper neural networks can approximate structured functions using far fewer parameters than shallow ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews how approximation theory for neural networks progressed from qualitative density results to quantitative bounds on error, network size, and smoothness. It focuses on depth-width trade-offs showing that greater depth improves parameter efficiency when target functions have suitable structure. A reader would care because these findings help explain why deep architectures succeed in practice by linking architecture choices directly to efficiency gains. The paper also surveys recent work on Kolmogorov-Arnold Networks as an alternative paradigm.

Core claim

Universal approximation theorems establish that feedforward neural networks are dense in continuous functions, L^p spaces, and Sobolev spaces under mild activation conditions, while quantitative extensions demonstrate that for structured target functions deeper networks achieve given approximation error with superior parameter counts compared to wider but shallower networks.

What carries the argument

Depth-width trade-offs in quantitative approximation bounds relating error to total parameters and smoothness of the target function.

Load-bearing premise

The reviewed results assume target functions possess sufficient smoothness or structure for the stated depth advantages to hold.

What would settle it

A specific structured function class together with explicit network constructions where increasing depth fails to reduce the parameter count needed for a fixed error would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2605.21451 by Himasish Talukdar, Soumendu Sundar Mukherjee.

**Figure 3.** Figure 3: Schematic of a feedforward neural network with [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Schematic of a KAN with L layers: each edge here represents a learnable (typically nonlinear) univariate function φ (l) j,i, and each blue node aggregates by summation. Thus, unlike FNNs, nonlinearities are associated with edges rather than nodes. Definition 3.2 (The action of a KAN layer). Given an input vector x = (x1, . . . , xdin) ∈ R din , a KAN layer Φ applied to x produces an output y = (y1, . . . ,… view at source ↗

read the original abstract

Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of $\mathbb{R}^d$, $L^p$ spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth--width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov--Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey reviewing universal approximation theorems for neural networks, classical density results for single-hidden-layer networks, quantitative approximation rates under smoothness assumptions, depth-width trade-offs showing superior parameter efficiency for deeper architectures on structured function classes, and recent approximation-theoretic results for Kolmogorov-Arnold Networks (KANs).

Significance. If the summaries of existing theorems are accurate, the paper offers a useful compilation of results from classical universal approximation to modern quantitative analyses of depth advantages and alternative architectures. This can serve as a reference for understanding parameter efficiency under Sobolev or compositional smoothness assumptions and for contextualizing KANs within approximation theory.

major comments (1)

[Depth-width trade-offs section] The central claim regarding depth-width trade-offs for structured functions is presented as a summary of prior theorems; however, without explicit cross-references to the precise assumptions (e.g., compositional smoothness vs. Sobolev) in each cited result, it is difficult to assess the scope of the claimed superiority. A dedicated subsection comparing the function classes across key theorems would strengthen the presentation.

minor comments (2)

Notation for network parameters (depth, width, total parameters) is introduced but not always used consistently when comparing bounds across different architectures; a table summarizing the rates would improve clarity.
[KANs section] The discussion of KANs would benefit from explicit mention of any known limitations or open questions in their approximation theory relative to standard feedforward networks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our survey and the constructive recommendation for minor revision. We agree that the depth-width trade-offs section can be strengthened by making the underlying function-class assumptions more explicit.

read point-by-point responses

Referee: [Depth-width trade-offs section] The central claim regarding depth-width trade-offs for structured functions is presented as a summary of prior theorems; however, without explicit cross-references to the precise assumptions (e.g., compositional smoothness vs. Sobolev) in each cited result, it is difficult to assess the scope of the claimed superiority. A dedicated subsection comparing the function classes across key theorems would strengthen the presentation.

Authors: We thank the referee for this observation. We agree that the manuscript would benefit from greater transparency regarding the assumptions in the cited depth-width results. In the revised version we will add a dedicated subsection (provisionally titled 'Comparison of Function Classes in Depth-Width Trade-offs') that systematically lists the smoothness and structural hypotheses (compositional smoothness, Sobolev regularity, etc.) for each key theorem, supplies explicit cross-references to the original statements, and includes a concise comparison table. This addition will clarify the scope of the claimed parameter-efficiency advantages without altering any of the existing theorems or proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity in literature survey

full rationale

This manuscript is a survey paper that compiles and reviews classical and recent external results on neural network approximation theory, universal approximation theorems, depth-width trade-offs, and Kolmogorov-Arnold networks. No original derivations, quantitative predictions, or fitted parameters are introduced by the authors. All central claims about parameter efficiency for structured function classes are explicitly presented as summaries of existing theorems from the literature (under standard smoothness assumptions), with no self-referential reductions or load-bearing self-citations that collapse the argument to the paper's own inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey, the paper introduces no new free parameters, axioms, or invented entities; it aggregates existing theory from the literature.

pith-pipeline@v0.9.0 · 5720 in / 938 out tokens · 26413 ms · 2026-05-21T05:35:22.934027+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Universal approximation theorems... depth–width trade-offs... Kolmogorov–Arnold Networks
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Barron norm... O(m^{-1/2}) rates... (p,C)-smooth functions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[1]

Arnol'd, V. I. (1957). On functions of three variables . In Doklady Akademii Nauk , volume 114, pages 679--681. Russian Academy of Sciences

work page 1957
[2]

Arnol'd, V. I. (1959). On the representation of continuous functions of three variables by superpositions of continuous functions of two variables . Matematicheskii Sbornik , 90(1):3--74

work page 1959
[3]

Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function . IEEE Transactions on Information Theory , 39(3):930–945

work page 1993
[4]

Baum, E. B. (1988). On the capabilities of multilayer perceptrons . Journal of Complexity , 4(3):193--215

work page 1988
[5]

and Scarselli, F

Bianchini, M. and Scarselli, F. (2014). On the complexity of neural network classifiers: A comparison between shallow and deep architectures . IEEE Transactions on Neural Networks and Learning Systems , 25(8):1553--1565

work page 2014
[6]

Cai, Y. (2023). Achieve the Minimum Width of Neural Networks for Universal Approximation . In The Eleventh International Conference on Learning Representations

work page 2023
[7]

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function . Mathematics of Control, Signals, and Systems , 2(4):303–314

work page 1989
[8]

Daubechies, I., DeVore, R., Foucart, S., Hanin, B., and Petrova, G. (2022). Nonlinear approximation and (deep) ReLU networks . Constructive Approximation , 55(1):127--172

work page 2022
[9]

De Boor, C. (1978). A practical guide to splines , volume 27. Springer-Verlag New York

work page 1978
[10]

De Ryck, T., Lanthaler, S., and Mishra, S. (2021). On the approximation of functions by tanh neural networks . Neural Networks , 143:732--750

work page 2021
[11]

DeVore, R., Hanin, B., and Petrova, G. (2021). Neural network approximation . Acta Numerica , 30:327--444

work page 2021
[12]

E, W., Ma, C., and Wu, L. (2021). The Barron Space and the Flow-Induced Function Spaces for Neural Network Models . Constructive Approximation , 55(1):369–406

work page 2021
[13]

and Shamir, O

Eldan, R. and Shamir, O. (2016). The Power of Depth for Feedforward Neural Networks . In Feldman, V., Rakhlin, A., and Shamir, O., editors, 29th Annual Conference on Learning Theory , volume 49 of Proceedings of Machine Learning Research , pages 907--940, Columbia University, New York, New York, USA. PMLR

work page 2016
[14]

Funahashi, K.-I. (1989). On the approximate realization of continuous mappings by neural networks . Neural Networks , 2(3):183--192

work page 1989
[15]

and Poggio, T

Girosi, F. and Poggio, T. (1989). Representation properties of networks: Kolmogorov's theorem is irrelevant . Neural Computation , 1(4):465--469

work page 1989
[16]

Approximating Continuous Functions by ReLU Nets of Minimal Width

Hanin, B. and Sellke, M. (2017). Approximating continuous functions by relu nets of minimal width . arXiv preprint arXiv:1710.11278

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Hecht-Nielsen, R. (1987). Kolmogorov’s mapping neural network existence theorem . In Proceedings of the international conference on Neural Networks , volume 3, pages 11--14. IEEE press New York, NY, USA

work page 1987
[18]

Hilbert, D. (1902). Mathematical problems . Bulletin of the American Mathematical Society , 8(10):437–479

work page 1902
[19]

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks . Neural Networks , 4(2):251--257

work page 1991
[20]

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators . Neural Networks , 2(5):359--366

work page 1989
[21]

Hwang, G. (2025). Optimal Minimum Width for the Universal Approximation of Continuously Differentiable Functions by Deep Narrow MLP s . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[22]

and Parikh, N

Igelnik, B. and Parikh, N. (2003). Kolmogorov's spline network . IEEE Transactions on Neural Networks , 14(4):725--733

work page 2003
[23]

Johnson, J. (2019). Deep, Skinny Neural Networks are not Universal Approximators . In International Conference on Learning Representations

work page 2019
[24]

and Lyons, T

Kidger, P. and Lyons, T. (2020). Universal approximation with deep narrow networks . In Conference on Learning Theory , pages 2306--2327. PMLR

work page 2020
[25]

Kim, N., Min, C., and Park, S. (2024). Minimum width for universal approximation using Re LU networks on compact domain . In The Twelfth International Conference on Learning Representations

work page 2024
[26]

Klusowski, J. M. and Barron, A. R. (2016). Risk bounds for high-dimensional ridge function combinations including neural networks . arXiv preprint arXiv:1607.01434

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

and Langer, S

Kohler, M. and Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates . The Annals of Statistics , 49(4):2231--2249

work page 2021
[28]

Kolmogorov, A. N. (1956). On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables . In Dokl. Akad. Nauk USSR , volume 108, pages 179--192

work page 1956
[29]

Kolmogorov, A. N. (1957). On the representations of continuous functions of many variables by superposition of continuous functions of one variable and addition . In Dokl. Akad. Nauk USSR , volume 114, pages 953--956

work page 1957
[30]

and Furuya, T

Kratsios, A. and Furuya, T. (2025). Kolmogorov-Arnold Networks: Approximation and learning guarantees for functions and their derivatives . arXiv preprint

work page 2025
[31]

K u rkov \'a , V. (1991). Kolmogorov's theorem is relevant . Neural Computation , 3(4):617--622

work page 1991
[32]

K u rkov \'a , V. (1992). Kolmogorov's theorem and multilayer neural networks . Neural Networks , 5(3):501--506

work page 1992
[33]

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning . Nature , 521(7553):436--444

work page 2015
[34]

Y., Pinkus, A., and Schocken, S

Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function . Neural Networks , 6(6):861–867

work page 1993
[35]

Li, L., Duan, Y., Ji, G., and Cai, Y. (2023). Minimum Width of Leaky-ReLU Neural Networks for Uniform Universal Approximation . In Proceedings of the 40th International Conference on Machine Learning , pages 19460--19470. PMLR

work page 2023
[36]

Y., and Tegmark, M

Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljacic, M., Hou, T. Y., and Tegmark, M. (2025). KAN : Kolmogorov Arnold Networks . In The Thirteenth International Conference on Learning Representations

work page 2025
[37]

Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural networks: A view from the width . Advances in Neural Information Processing Systems , 30

work page 2017
[38]

Makovoz, Y. (1996). Random Approximants and Neural Networks . Journal of Approximation Theory , 85(1):98–109

work page 1996
[39]

McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity . The Bulletin of Mathematical Biophysics , 5(4):115--133

work page 1943
[40]

Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions . Neural Computation , 8(1):164--177

work page 1996
[41]

Mont \'u far, G., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions of deep neural networks . Advances in Neural Information Processing Systems , 27

work page 2014
[42]

Morris, S. (2021). Hilbert 13: Are there any genuine continuous multivariate real-valued functions? Bulletin of the American Mathematical Society , 58(1):107--118

work page 2021
[43]

Park, S., Yun, C., Lee, J., and Shin, J. (2021). Minimum Width for Universal Approximation . In International Conference on Learning Representations

work page 2021
[44]

and Voigtlaender, F

Petersen, P. and Voigtlaender, F. (2018). Optimal approximation of piecewise smooth functions using deep ReLU neural networks . Neural Networks , 108:296--330

work page 2018
[45]

Pinkus, A. (1999). Approximation theory of the MLP model in neural networks . Acta Numerica , 8:143–195

work page 1999
[46]

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponential expressivity in deep neural networks through transient chaos . Advances in Neural Information Processing Systems , 29

work page 2016
[47]

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. (2017). On the Expressive Power of Deep Neural Networks . In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 2847--2854. PMLR

work page 2017
[48]

Rochau, D., Chan, R., and Gottschalk, H. (2024). New advances in universal approximation with neural networks of minimal width . arXiv preprint arXiv:2411.08735

work page arXiv 2024
[49]

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review , 65(6):386

work page 1958
[50]

and Chung Tsoi, A

Scarselli, F. and Chung Tsoi, A. (1998). Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results . Neural Networks , 11(1):15–37

work page 1998
[51]

Schumaker, L. (2007). Spline functions: basic theory . Cambridge University Press

work page 2007
[52]

Schwartz, L. (1944). Sur certaines familles non fondamentales de fonctions continues . Bulletin de la Soci \'e t \'e Math \'e matique de France , 72:141--145

work page 1944
[53]

Siegel, J. W. (2023). Optimal approximation rates for deep ReLU neural networks on Sobolev and Besov spaces . Journal of Machine Learning Research , 24(357):1--52

work page 2023
[54]

Siegel, J. W. and Xu, J. (2022). High-order approximation rates for shallow neural networks with cosine and ReLUk activation functions . Applied and Computational Harmonic Analysis , 58:1--26

work page 2022
[55]

Telgarsky, M. (2015). Representation benefits of deep feedforward networks . arXiv preprint arXiv:1509.08101

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

Telgarsky, M. (2016). Benefits of depth in neural networks . In Conference on Learning Theory , pages 1517--1539. PMLR

work page 2016
[57]

Telgarsky, M. (2021). Deep learning theory lecture notes . https://mjt.cs.illinois.edu/dlt/. Version: 2021-10-27 v0.0-e7150f2d (alpha)

work page 2021
[58]

Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks . Neural Networks , 94:103--114

work page 2017
[59]

Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep ReLU networks . In Conference on Learning Theory , pages 639--649. PMLR

work page 2018

[1] [1]

Arnol'd, V. I. (1957). On functions of three variables . In Doklady Akademii Nauk , volume 114, pages 679--681. Russian Academy of Sciences

work page 1957

[2] [2]

Arnol'd, V. I. (1959). On the representation of continuous functions of three variables by superpositions of continuous functions of two variables . Matematicheskii Sbornik , 90(1):3--74

work page 1959

[3] [3]

Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function . IEEE Transactions on Information Theory , 39(3):930–945

work page 1993

[4] [4]

Baum, E. B. (1988). On the capabilities of multilayer perceptrons . Journal of Complexity , 4(3):193--215

work page 1988

[5] [5]

and Scarselli, F

Bianchini, M. and Scarselli, F. (2014). On the complexity of neural network classifiers: A comparison between shallow and deep architectures . IEEE Transactions on Neural Networks and Learning Systems , 25(8):1553--1565

work page 2014

[6] [6]

Cai, Y. (2023). Achieve the Minimum Width of Neural Networks for Universal Approximation . In The Eleventh International Conference on Learning Representations

work page 2023

[7] [7]

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function . Mathematics of Control, Signals, and Systems , 2(4):303–314

work page 1989

[8] [8]

Daubechies, I., DeVore, R., Foucart, S., Hanin, B., and Petrova, G. (2022). Nonlinear approximation and (deep) ReLU networks . Constructive Approximation , 55(1):127--172

work page 2022

[9] [9]

De Boor, C. (1978). A practical guide to splines , volume 27. Springer-Verlag New York

work page 1978

[10] [10]

De Ryck, T., Lanthaler, S., and Mishra, S. (2021). On the approximation of functions by tanh neural networks . Neural Networks , 143:732--750

work page 2021

[11] [11]

DeVore, R., Hanin, B., and Petrova, G. (2021). Neural network approximation . Acta Numerica , 30:327--444

work page 2021

[12] [12]

E, W., Ma, C., and Wu, L. (2021). The Barron Space and the Flow-Induced Function Spaces for Neural Network Models . Constructive Approximation , 55(1):369–406

work page 2021

[13] [13]

and Shamir, O

Eldan, R. and Shamir, O. (2016). The Power of Depth for Feedforward Neural Networks . In Feldman, V., Rakhlin, A., and Shamir, O., editors, 29th Annual Conference on Learning Theory , volume 49 of Proceedings of Machine Learning Research , pages 907--940, Columbia University, New York, New York, USA. PMLR

work page 2016

[14] [14]

Funahashi, K.-I. (1989). On the approximate realization of continuous mappings by neural networks . Neural Networks , 2(3):183--192

work page 1989

[15] [15]

and Poggio, T

Girosi, F. and Poggio, T. (1989). Representation properties of networks: Kolmogorov's theorem is irrelevant . Neural Computation , 1(4):465--469

work page 1989

[16] [16]

Approximating Continuous Functions by ReLU Nets of Minimal Width

Hanin, B. and Sellke, M. (2017). Approximating continuous functions by relu nets of minimal width . arXiv preprint arXiv:1710.11278

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Hecht-Nielsen, R. (1987). Kolmogorov’s mapping neural network existence theorem . In Proceedings of the international conference on Neural Networks , volume 3, pages 11--14. IEEE press New York, NY, USA

work page 1987

[18] [18]

Hilbert, D. (1902). Mathematical problems . Bulletin of the American Mathematical Society , 8(10):437–479

work page 1902

[19] [19]

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks . Neural Networks , 4(2):251--257

work page 1991

[20] [20]

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators . Neural Networks , 2(5):359--366

work page 1989

[21] [21]

Hwang, G. (2025). Optimal Minimum Width for the Universal Approximation of Continuously Differentiable Functions by Deep Narrow MLP s . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[22] [22]

and Parikh, N

Igelnik, B. and Parikh, N. (2003). Kolmogorov's spline network . IEEE Transactions on Neural Networks , 14(4):725--733

work page 2003

[23] [23]

Johnson, J. (2019). Deep, Skinny Neural Networks are not Universal Approximators . In International Conference on Learning Representations

work page 2019

[24] [24]

and Lyons, T

Kidger, P. and Lyons, T. (2020). Universal approximation with deep narrow networks . In Conference on Learning Theory , pages 2306--2327. PMLR

work page 2020

[25] [25]

Kim, N., Min, C., and Park, S. (2024). Minimum width for universal approximation using Re LU networks on compact domain . In The Twelfth International Conference on Learning Representations

work page 2024

[26] [26]

Klusowski, J. M. and Barron, A. R. (2016). Risk bounds for high-dimensional ridge function combinations including neural networks . arXiv preprint arXiv:1607.01434

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

and Langer, S

Kohler, M. and Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates . The Annals of Statistics , 49(4):2231--2249

work page 2021

[28] [28]

Kolmogorov, A. N. (1956). On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables . In Dokl. Akad. Nauk USSR , volume 108, pages 179--192

work page 1956

[29] [29]

Kolmogorov, A. N. (1957). On the representations of continuous functions of many variables by superposition of continuous functions of one variable and addition . In Dokl. Akad. Nauk USSR , volume 114, pages 953--956

work page 1957

[30] [30]

and Furuya, T

Kratsios, A. and Furuya, T. (2025). Kolmogorov-Arnold Networks: Approximation and learning guarantees for functions and their derivatives . arXiv preprint

work page 2025

[31] [31]

K u rkov \'a , V. (1991). Kolmogorov's theorem is relevant . Neural Computation , 3(4):617--622

work page 1991

[32] [32]

K u rkov \'a , V. (1992). Kolmogorov's theorem and multilayer neural networks . Neural Networks , 5(3):501--506

work page 1992

[33] [33]

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning . Nature , 521(7553):436--444

work page 2015

[34] [34]

Y., Pinkus, A., and Schocken, S

Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function . Neural Networks , 6(6):861–867

work page 1993

[35] [35]

Li, L., Duan, Y., Ji, G., and Cai, Y. (2023). Minimum Width of Leaky-ReLU Neural Networks for Uniform Universal Approximation . In Proceedings of the 40th International Conference on Machine Learning , pages 19460--19470. PMLR

work page 2023

[36] [36]

Y., and Tegmark, M

Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljacic, M., Hou, T. Y., and Tegmark, M. (2025). KAN : Kolmogorov Arnold Networks . In The Thirteenth International Conference on Learning Representations

work page 2025

[37] [37]

Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural networks: A view from the width . Advances in Neural Information Processing Systems , 30

work page 2017

[38] [38]

Makovoz, Y. (1996). Random Approximants and Neural Networks . Journal of Approximation Theory , 85(1):98–109

work page 1996

[39] [39]

McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity . The Bulletin of Mathematical Biophysics , 5(4):115--133

work page 1943

[40] [40]

Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions . Neural Computation , 8(1):164--177

work page 1996

[41] [41]

Mont \'u far, G., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions of deep neural networks . Advances in Neural Information Processing Systems , 27

work page 2014

[42] [42]

Morris, S. (2021). Hilbert 13: Are there any genuine continuous multivariate real-valued functions? Bulletin of the American Mathematical Society , 58(1):107--118

work page 2021

[43] [43]

Park, S., Yun, C., Lee, J., and Shin, J. (2021). Minimum Width for Universal Approximation . In International Conference on Learning Representations

work page 2021

[44] [44]

and Voigtlaender, F

Petersen, P. and Voigtlaender, F. (2018). Optimal approximation of piecewise smooth functions using deep ReLU neural networks . Neural Networks , 108:296--330

work page 2018

[45] [45]

Pinkus, A. (1999). Approximation theory of the MLP model in neural networks . Acta Numerica , 8:143–195

work page 1999

[46] [46]

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponential expressivity in deep neural networks through transient chaos . Advances in Neural Information Processing Systems , 29

work page 2016

[47] [47]

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. (2017). On the Expressive Power of Deep Neural Networks . In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 2847--2854. PMLR

work page 2017

[48] [48]

Rochau, D., Chan, R., and Gottschalk, H. (2024). New advances in universal approximation with neural networks of minimal width . arXiv preprint arXiv:2411.08735

work page arXiv 2024

[49] [49]

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review , 65(6):386

work page 1958

[50] [50]

and Chung Tsoi, A

Scarselli, F. and Chung Tsoi, A. (1998). Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results . Neural Networks , 11(1):15–37

work page 1998

[51] [51]

Schumaker, L. (2007). Spline functions: basic theory . Cambridge University Press

work page 2007

[52] [52]

Schwartz, L. (1944). Sur certaines familles non fondamentales de fonctions continues . Bulletin de la Soci \'e t \'e Math \'e matique de France , 72:141--145

work page 1944

[53] [53]

Siegel, J. W. (2023). Optimal approximation rates for deep ReLU neural networks on Sobolev and Besov spaces . Journal of Machine Learning Research , 24(357):1--52

work page 2023

[54] [54]

Siegel, J. W. and Xu, J. (2022). High-order approximation rates for shallow neural networks with cosine and ReLUk activation functions . Applied and Computational Harmonic Analysis , 58:1--26

work page 2022

[55] [55]

Telgarsky, M. (2015). Representation benefits of deep feedforward networks . arXiv preprint arXiv:1509.08101

work page internal anchor Pith review Pith/arXiv arXiv 2015

[56] [56]

Telgarsky, M. (2016). Benefits of depth in neural networks . In Conference on Learning Theory , pages 1517--1539. PMLR

work page 2016

[57] [57]

Telgarsky, M. (2021). Deep learning theory lecture notes . https://mjt.cs.illinois.edu/dlt/. Version: 2021-10-27 v0.0-e7150f2d (alpha)

work page 2021

[58] [58]

Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks . Neural Networks , 94:103--114

work page 2017

[59] [59]

Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep ReLU networks . In Conference on Learning Theory , pages 639--649. PMLR

work page 2018