pith. sign in

arxiv: 2605.21451 · v1 · pith:P4XSS2QQnew · submitted 2026-05-20 · 💻 cs.LG · cond-mat.dis-nn· cs.AI· cs.NE

Approximation Theory for Neural Networks: Old and New

Pith reviewed 2026-05-21 05:35 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.AIcs.NE
keywords neural network approximationuniversal approximationdepth-width trade-offparameter efficiencyKolmogorov-Arnold networksquantitative boundssmoothnessfeedforward networks
0
0 comments X

The pith

Deeper neural networks can approximate structured functions using far fewer parameters than shallow ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews how approximation theory for neural networks progressed from qualitative density results to quantitative bounds on error, network size, and smoothness. It focuses on depth-width trade-offs showing that greater depth improves parameter efficiency when target functions have suitable structure. A reader would care because these findings help explain why deep architectures succeed in practice by linking architecture choices directly to efficiency gains. The paper also surveys recent work on Kolmogorov-Arnold Networks as an alternative paradigm.

Core claim

Universal approximation theorems establish that feedforward neural networks are dense in continuous functions, L^p spaces, and Sobolev spaces under mild activation conditions, while quantitative extensions demonstrate that for structured target functions deeper networks achieve given approximation error with superior parameter counts compared to wider but shallower networks.

What carries the argument

Depth-width trade-offs in quantitative approximation bounds relating error to total parameters and smoothness of the target function.

Load-bearing premise

The reviewed results assume target functions possess sufficient smoothness or structure for the stated depth advantages to hold.

What would settle it

A specific structured function class together with explicit network constructions where increasing depth fails to reduce the parameter count needed for a fixed error would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2605.21451 by Himasish Talukdar, Soumendu Sundar Mukherjee.

Figure 2
Figure 2. Figure 2: A multilayer perceptron representing XOR. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schematic of a feedforward neural network with [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Schematic of a KAN with L layers: each edge here represents a learnable (typically nonlinear) univariate function φ (l) j,i, and each blue node aggregates by summation. Thus, unlike FNNs, nonlinearities are associated with edges rather than nodes. Definition 3.2 (The action of a KAN layer). Given an input vector x = (x1, . . . , xdin) ∈ R din , a KAN layer Φ applied to x produces an output y = (y1, . . . ,… view at source ↗
read the original abstract

Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of $\mathbb{R}^d$, $L^p$ spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth--width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov--Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey reviewing universal approximation theorems for neural networks, classical density results for single-hidden-layer networks, quantitative approximation rates under smoothness assumptions, depth-width trade-offs showing superior parameter efficiency for deeper architectures on structured function classes, and recent approximation-theoretic results for Kolmogorov-Arnold Networks (KANs).

Significance. If the summaries of existing theorems are accurate, the paper offers a useful compilation of results from classical universal approximation to modern quantitative analyses of depth advantages and alternative architectures. This can serve as a reference for understanding parameter efficiency under Sobolev or compositional smoothness assumptions and for contextualizing KANs within approximation theory.

major comments (1)
  1. [Depth-width trade-offs section] The central claim regarding depth-width trade-offs for structured functions is presented as a summary of prior theorems; however, without explicit cross-references to the precise assumptions (e.g., compositional smoothness vs. Sobolev) in each cited result, it is difficult to assess the scope of the claimed superiority. A dedicated subsection comparing the function classes across key theorems would strengthen the presentation.
minor comments (2)
  1. Notation for network parameters (depth, width, total parameters) is introduced but not always used consistently when comparing bounds across different architectures; a table summarizing the rates would improve clarity.
  2. [KANs section] The discussion of KANs would benefit from explicit mention of any known limitations or open questions in their approximation theory relative to standard feedforward networks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our survey and the constructive recommendation for minor revision. We agree that the depth-width trade-offs section can be strengthened by making the underlying function-class assumptions more explicit.

read point-by-point responses
  1. Referee: [Depth-width trade-offs section] The central claim regarding depth-width trade-offs for structured functions is presented as a summary of prior theorems; however, without explicit cross-references to the precise assumptions (e.g., compositional smoothness vs. Sobolev) in each cited result, it is difficult to assess the scope of the claimed superiority. A dedicated subsection comparing the function classes across key theorems would strengthen the presentation.

    Authors: We thank the referee for this observation. We agree that the manuscript would benefit from greater transparency regarding the assumptions in the cited depth-width results. In the revised version we will add a dedicated subsection (provisionally titled 'Comparison of Function Classes in Depth-Width Trade-offs') that systematically lists the smoothness and structural hypotheses (compositional smoothness, Sobolev regularity, etc.) for each key theorem, supplies explicit cross-references to the original statements, and includes a concise comparison table. This addition will clarify the scope of the claimed parameter-efficiency advantages without altering any of the existing theorems or proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity in literature survey

full rationale

This manuscript is a survey paper that compiles and reviews classical and recent external results on neural network approximation theory, universal approximation theorems, depth-width trade-offs, and Kolmogorov-Arnold networks. No original derivations, quantitative predictions, or fitted parameters are introduced by the authors. All central claims about parameter efficiency for structured function classes are explicitly presented as summaries of existing theorems from the literature (under standard smoothness assumptions), with no self-referential reductions or load-bearing self-citations that collapse the argument to the paper's own inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey, the paper introduces no new free parameters, axioms, or invented entities; it aggregates existing theory from the literature.

pith-pipeline@v0.9.0 · 5720 in / 938 out tokens · 26413 ms · 2026-05-21T05:35:22.934027+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [1]

    Arnol'd, V. I. (1957). On functions of three variables . In Doklady Akademii Nauk , volume 114, pages 679--681. Russian Academy of Sciences

  2. [2]

    Arnol'd, V. I. (1959). On the representation of continuous functions of three variables by superpositions of continuous functions of two variables . Matematicheskii Sbornik , 90(1):3--74

  3. [3]

    Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function . IEEE Transactions on Information Theory , 39(3):930–945

  4. [4]

    Baum, E. B. (1988). On the capabilities of multilayer perceptrons . Journal of Complexity , 4(3):193--215

  5. [5]

    and Scarselli, F

    Bianchini, M. and Scarselli, F. (2014). On the complexity of neural network classifiers: A comparison between shallow and deep architectures . IEEE Transactions on Neural Networks and Learning Systems , 25(8):1553--1565

  6. [6]

    Cai, Y. (2023). Achieve the Minimum Width of Neural Networks for Universal Approximation . In The Eleventh International Conference on Learning Representations

  7. [7]

    Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function . Mathematics of Control, Signals, and Systems , 2(4):303–314

  8. [8]

    Daubechies, I., DeVore, R., Foucart, S., Hanin, B., and Petrova, G. (2022). Nonlinear approximation and (deep) ReLU networks . Constructive Approximation , 55(1):127--172

  9. [9]

    De Boor, C. (1978). A practical guide to splines , volume 27. Springer-Verlag New York

  10. [10]

    De Ryck, T., Lanthaler, S., and Mishra, S. (2021). On the approximation of functions by tanh neural networks . Neural Networks , 143:732--750

  11. [11]

    DeVore, R., Hanin, B., and Petrova, G. (2021). Neural network approximation . Acta Numerica , 30:327--444

  12. [12]

    E, W., Ma, C., and Wu, L. (2021). The Barron Space and the Flow-Induced Function Spaces for Neural Network Models . Constructive Approximation , 55(1):369–406

  13. [13]

    and Shamir, O

    Eldan, R. and Shamir, O. (2016). The Power of Depth for Feedforward Neural Networks . In Feldman, V., Rakhlin, A., and Shamir, O., editors, 29th Annual Conference on Learning Theory , volume 49 of Proceedings of Machine Learning Research , pages 907--940, Columbia University, New York, New York, USA. PMLR

  14. [14]

    Funahashi, K.-I. (1989). On the approximate realization of continuous mappings by neural networks . Neural Networks , 2(3):183--192

  15. [15]

    and Poggio, T

    Girosi, F. and Poggio, T. (1989). Representation properties of networks: Kolmogorov's theorem is irrelevant . Neural Computation , 1(4):465--469

  16. [16]

    Approximating Continuous Functions by ReLU Nets of Minimal Width

    Hanin, B. and Sellke, M. (2017). Approximating continuous functions by relu nets of minimal width . arXiv preprint arXiv:1710.11278

  17. [17]

    Hecht-Nielsen, R. (1987). Kolmogorov’s mapping neural network existence theorem . In Proceedings of the international conference on Neural Networks , volume 3, pages 11--14. IEEE press New York, NY, USA

  18. [18]

    Hilbert, D. (1902). Mathematical problems . Bulletin of the American Mathematical Society , 8(10):437–479

  19. [19]

    Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks . Neural Networks , 4(2):251--257

  20. [20]

    Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators . Neural Networks , 2(5):359--366

  21. [21]

    Hwang, G. (2025). Optimal Minimum Width for the Universal Approximation of Continuously Differentiable Functions by Deep Narrow MLP s . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  22. [22]

    and Parikh, N

    Igelnik, B. and Parikh, N. (2003). Kolmogorov's spline network . IEEE Transactions on Neural Networks , 14(4):725--733

  23. [23]

    Johnson, J. (2019). Deep, Skinny Neural Networks are not Universal Approximators . In International Conference on Learning Representations

  24. [24]

    and Lyons, T

    Kidger, P. and Lyons, T. (2020). Universal approximation with deep narrow networks . In Conference on Learning Theory , pages 2306--2327. PMLR

  25. [25]

    Kim, N., Min, C., and Park, S. (2024). Minimum width for universal approximation using Re LU networks on compact domain . In The Twelfth International Conference on Learning Representations

  26. [26]

    Klusowski, J. M. and Barron, A. R. (2016). Risk bounds for high-dimensional ridge function combinations including neural networks . arXiv preprint arXiv:1607.01434

  27. [27]

    and Langer, S

    Kohler, M. and Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates . The Annals of Statistics , 49(4):2231--2249

  28. [28]

    Kolmogorov, A. N. (1956). On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables . In Dokl. Akad. Nauk USSR , volume 108, pages 179--192

  29. [29]

    Kolmogorov, A. N. (1957). On the representations of continuous functions of many variables by superposition of continuous functions of one variable and addition . In Dokl. Akad. Nauk USSR , volume 114, pages 953--956

  30. [30]

    and Furuya, T

    Kratsios, A. and Furuya, T. (2025). Kolmogorov-Arnold Networks: Approximation and learning guarantees for functions and their derivatives . arXiv preprint

  31. [31]

    K u rkov \'a , V. (1991). Kolmogorov's theorem is relevant . Neural Computation , 3(4):617--622

  32. [32]

    K u rkov \'a , V. (1992). Kolmogorov's theorem and multilayer neural networks . Neural Networks , 5(3):501--506

  33. [33]

    LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning . Nature , 521(7553):436--444

  34. [34]

    Y., Pinkus, A., and Schocken, S

    Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function . Neural Networks , 6(6):861–867

  35. [35]

    Li, L., Duan, Y., Ji, G., and Cai, Y. (2023). Minimum Width of Leaky-ReLU Neural Networks for Uniform Universal Approximation . In Proceedings of the 40th International Conference on Machine Learning , pages 19460--19470. PMLR

  36. [36]

    Y., and Tegmark, M

    Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljacic, M., Hou, T. Y., and Tegmark, M. (2025). KAN : Kolmogorov Arnold Networks . In The Thirteenth International Conference on Learning Representations

  37. [37]

    Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural networks: A view from the width . Advances in Neural Information Processing Systems , 30

  38. [38]

    Makovoz, Y. (1996). Random Approximants and Neural Networks . Journal of Approximation Theory , 85(1):98–109

  39. [39]

    McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity . The Bulletin of Mathematical Biophysics , 5(4):115--133

  40. [40]

    Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions . Neural Computation , 8(1):164--177

  41. [41]

    Mont \'u far, G., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions of deep neural networks . Advances in Neural Information Processing Systems , 27

  42. [42]

    Morris, S. (2021). Hilbert 13: Are there any genuine continuous multivariate real-valued functions? Bulletin of the American Mathematical Society , 58(1):107--118

  43. [43]

    Park, S., Yun, C., Lee, J., and Shin, J. (2021). Minimum Width for Universal Approximation . In International Conference on Learning Representations

  44. [44]

    and Voigtlaender, F

    Petersen, P. and Voigtlaender, F. (2018). Optimal approximation of piecewise smooth functions using deep ReLU neural networks . Neural Networks , 108:296--330

  45. [45]

    Pinkus, A. (1999). Approximation theory of the MLP model in neural networks . Acta Numerica , 8:143–195

  46. [46]

    Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponential expressivity in deep neural networks through transient chaos . Advances in Neural Information Processing Systems , 29

  47. [47]

    Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. (2017). On the Expressive Power of Deep Neural Networks . In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 2847--2854. PMLR

  48. [48]

    Rochau, D., Chan, R., and Gottschalk, H. (2024). New advances in universal approximation with neural networks of minimal width . arXiv preprint arXiv:2411.08735

  49. [49]

    Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review , 65(6):386

  50. [50]

    and Chung Tsoi, A

    Scarselli, F. and Chung Tsoi, A. (1998). Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results . Neural Networks , 11(1):15–37

  51. [51]

    Schumaker, L. (2007). Spline functions: basic theory . Cambridge University Press

  52. [52]

    Schwartz, L. (1944). Sur certaines familles non fondamentales de fonctions continues . Bulletin de la Soci \'e t \'e Math \'e matique de France , 72:141--145

  53. [53]

    Siegel, J. W. (2023). Optimal approximation rates for deep ReLU neural networks on Sobolev and Besov spaces . Journal of Machine Learning Research , 24(357):1--52

  54. [54]

    Siegel, J. W. and Xu, J. (2022). High-order approximation rates for shallow neural networks with cosine and ReLUk activation functions . Applied and Computational Harmonic Analysis , 58:1--26

  55. [55]

    Telgarsky, M. (2015). Representation benefits of deep feedforward networks . arXiv preprint arXiv:1509.08101

  56. [56]

    Telgarsky, M. (2016). Benefits of depth in neural networks . In Conference on Learning Theory , pages 1517--1539. PMLR

  57. [57]

    Telgarsky, M. (2021). Deep learning theory lecture notes . https://mjt.cs.illinois.edu/dlt/. Version: 2021-10-27 v0.0-e7150f2d (alpha)

  58. [58]

    Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks . Neural Networks , 94:103--114

  59. [59]

    Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep ReLU networks . In Conference on Learning Theory , pages 639--649. PMLR