pith. sign in

arxiv: 2604.20219 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.NA· math.NA· stat.ML

Geometric Layer-wise Approximation Rates for Deep Networks

Pith reviewed 2026-05-10 01:22 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAstat.ML
keywords deep neural networksapproximation ratesmodulus of continuitylayer-wise approximationgeometric convergencemixed activation functionsmultigrade learning
0
0 comments X

The pith

A fixed-width deep network with mixed activations makes every intermediate layer a valid approximant to any L^p function at successively finer scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs one neural network architecture of width 2dN plus d plus 2 that can be extended to any finite depth while ensuring each partial output after ell layers approximates the target with an error controlled by the L^p modulus of continuity evaluated at scale N to the minus ell. This gives depth a concrete meaning as successive correction of residuals at geometrically smaller scales, without discarding earlier layers. For one-Lipschitz targets the bound simplifies to a pure geometric rate proportional to N to the minus ell. The result matters because it supplies a quantitative account of why adding layers improves approximation even when width stays fixed, and it supports adaptive refinement by keeping all prior corrections inside later readouts.

Core claim

We design a single shared mixed-activation architecture of fixed width 2dN+d+2 and any prescribed finite depth such that each intermediate readout Phi_ell is itself an approximant to the target function f. For f in L^p([0,1]^d) with p in [1, infinity), the approximation error of Phi_ell is controlled by (2d+1) times the L^p modulus of continuity at the geometric scale N^{-ell} for all ell. The estimate reduces to the geometric rate (2d+1)N^{-ell} if f is 1-Lipschitz.

What carries the argument

The mixed-activation network with nested intermediate readouts Phi_ell that accumulate corrections at successive geometric scales N^{-ell} while preserving all earlier terms in later outputs.

If this is right

  • Each added layer refines the approximation at a strictly smaller geometric scale without altering earlier layers.
  • The same fixed-width network serves as a valid approximant at every depth, enabling depth to act as a continuous refinement parameter.
  • For Lipschitz functions the error contracts geometrically with depth at a rate independent of the particular function beyond its Lipschitz constant.
  • The construction supports multigrade learning in which each new layer targets only the residual information left at finer scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training algorithms could monitor layer-wise error reduction on a validation set to decide when to stop deepening the network.
  • The nested readout structure suggests a possible link between deep networks and classical multi-resolution methods such as wavelets.
  • Similar geometric layer-wise bounds may hold for other activation families or for approximation in different norms once the existence of the shared architecture is verified.

Load-bearing premise

A single shared mixed-activation architecture of fixed width 2dN+d+2 exists for any prescribed finite depth such that each intermediate readout satisfies the stated modulus-of-continuity error bound.

What would settle it

For d=1, N=2 and the 1-Lipschitz function f(x)=|2x-1| on [0,1], build the network to depth ell=3 and check whether the L^infty approximation error after three layers exceeds 3 times 2^{-3}.

Figures

Figures reproduced from arXiv: 2604.20219 by Shijun Zhang, Yuesheng Xu, Zuowei Shen.

Figure 1
Figure 1. Figure 1: Approximation behavior in standard deep neural networks, where a quantitative error [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Each block Tℓ = ϱ ◦ Aℓ represents one hidden layer, while gℓ = Aout ℓ is an affine output head. The layer-ℓ readout Φℓ is required to approximate the target function with its own explicit error bound. To illustrate our main results, we first consider the approximation of a 1-Lipschitz target function f on [0, 1]d , although the principal results in Section 2 are formulated in the more general setting of L … view at source ↗
Figure 3
Figure 3. Figure 3: Multilevel interpretation of Theorem 2.1. For each ℓ ∈ {0, 1, · · · , L}, the readout Φℓ approximates f with εℓ ≤ (2d + 1) ωf,p(N −ℓ ). For convenience, we index the layers from 0 and regard ϱ ◦ A−1 as an initialization layer rather than a true hidden layer. Theorem 2.1 establishes a genuine multilevel approximation principle within a fixed-width architecture. If one writes Φℓ = Φℓ−1 + Γℓ , where Γℓ denote… view at source ↗
Figure 4
Figure 4. Figure 4: Interior cubes Qℓ,β and transition regions Ωℓ for N = 2 at the levels ℓ = 0, 1, 2. The approximation itself is built recursively. Starting from f0 := f, we construct refinement modules Γℓ and residuals fℓ+1 through Φℓ := X ℓ j=0 Γj , fℓ+1 := fℓ − Γℓ for ℓ = 0, 1, · · · , L. The purpose of the layer-ℓ module Γℓ is to capture the behavior of the current residual at scale N −ℓ . We also note that Φℓ is consta… view at source ↗
Figure 5
Figure 5. Figure 5: The step-proxy function h for N = 4. Define the two-variable map hℓ : R 2 → R 2 by hℓ x y  := " x 1 Nℓ h [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A network realization of hℓ on [0, ∞) 2 . The next proposition provides a recursive encoder for the one-dimensional cell index at the levels ℓ ≥ 1. At the base level ℓ = 0, by contrast, no localization is needed, since there is only one interior cube. Proposition 3.1. For every ℓ ∈ {1, 2, · · · , L} and every j ∈ {0, 1, · · · , Nℓ − 1}, hℓ ◦ · · · ◦ h1 x 0  =  x j/Nℓ  for all x ∈ Iℓ,j . Proof. We arg… view at source ↗
Figure 7
Figure 7. Figure 7: Conceptual overview of the multigrade construction, showing how the encoder and [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A shared network architecture realizing the layer-wise approximants Φ [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Depth is widely viewed as a central contributor to the success of deep neural networks, whereas standard neural network approximation theory typically provides guarantees only for the final output and leaves the role of intermediate layers largely unclear. We address this gap by developing a quantitative framework in which depth admits a precise scale-dependent interpretation. Specifically, we design a single shared mixed-activation architecture of fixed width $2dN+d+2$ and any prescribed finite depth such that each intermediate readout $\Phi_\ell$ is itself an approximant to the target function $f$. For $f\in L^p([0,1]^d)$ with $p\in [1,\infty)$, the approximation error of $\Phi_\ell$ is controlled by $(2d+1)$ times the $L^p$ modulus of continuity at the geometric scale $N^{-\ell}$ for all $\ell$. The estimate reduces to the geometric rate $(2d+1)N^{-\ell}$ if $f$ is $1$-Lipschitz. Our network design is inspired by multigrade deep learning, where depth serves as a progressive refinement mechanism: each new correction targets residual information at a finer scale while the earlier correction terms remain part of the later readouts, yielding a nested architecture that supports adaptive refinement without redesigning the preceding network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper develops a quantitative framework for interpreting depth in neural networks via a single shared mixed-activation architecture of fixed width 2dN + d + 2 and arbitrary finite depth. Each intermediate readout Φ_ℓ approximates f ∈ L^p([0,1]^d) with error ||Φ_ℓ - f||_p ≤ (2d+1) ω_p(f, N^{-ℓ}), where ω_p is the L^p modulus of continuity at geometric scale N^{-ℓ}; the bound simplifies to the geometric rate (2d+1)N^{-ℓ} when f is 1-Lipschitz. The nested design allows progressive refinement at finer scales while retaining earlier corrections.

Significance. If the explicit construction and bounds are rigorously established, the result is significant for providing the first layer-wise, scale-dependent approximation guarantees in deep network theory, moving beyond final-output-only bounds. The fixed-width shared architecture and parameter-free geometric rates (depending only on d, N, ℓ) are notable strengths that align with multiscale approximation ideas and could inform adaptive network design.

minor comments (4)
  1. The abstract asserts the architecture and bounds without derivation details; the main text must supply the explicit construction of the mixed-activation network (including the specific activations and how the width 2dN+d+2 is achieved) and the full proof of the error bound to permit verification.
  2. Define the L^p modulus of continuity ω_p(f, δ) explicitly in the preliminaries section, including its precise mathematical expression.
  3. Clarify in the main text how the nested readouts are formed (e.g., which layers are read out at each ℓ) and confirm that no error accumulation occurs across depths.
  4. Add a brief comparison in the introduction or related-work section to prior approximation results for deep networks to better highlight the novelty of the intermediate-readout guarantees.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and assessment of the significance of our layer-wise approximation framework. The recommendation for minor revision is noted. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No circularity: constructive multiscale network with independent error bounds

full rationale

The paper presents an explicit construction of a fixed-width mixed-activation network whose intermediate readouts Φ_ℓ satisfy an error bound controlled by the L^p modulus of continuity ω_p(f, N^{-ℓ}). This bound follows directly from the definition of the modulus of continuity once the network is built to realize successive corrections at geometric scales; the Lipschitz case is an immediate specialization. No parameter is fitted to data and then relabeled as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and the derivation does not reduce any claimed result to its own inputs by definition. The architecture is self-contained against external benchmarks (standard modulus-of-continuity estimates) and does not rely on renaming known empirical patterns or smuggling ansatzes via prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of the described architecture and on standard properties of the L^p modulus of continuity; no explicit free parameters beyond the input dimension d and scale parameter N are introduced in the abstract, and no new entities with independent evidence are postulated.

free parameters (1)
  • N
    Scale parameter that sets the geometric refinement level per layer; chosen to achieve desired approximation precision.
axioms (1)
  • standard math Standard properties of the L^p modulus of continuity on [0,1]^d
    The error bound is expressed directly in terms of the modulus of continuity, relying on its known definition and inequalities in L^p spaces.
invented entities (1)
  • mixed-activation architecture with intermediate readouts Φ_ℓ no independent evidence
    purpose: To realize nested, progressive refinement where each layer improves the approximation at a finer geometric scale while preserving earlier terms
    New architecture type introduced to achieve the layer-wise guarantees; no independent evidence outside the claimed construction is provided.

pith-pipeline@v0.9.0 · 5537 in / 1536 out tokens · 41215 ms · 2026-05-10T01:22:28.683431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 38 canonical work pages

  1. [1]

    Understanding gradient descent on the edge of stability in deep learning

    Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. InInternational Conference on Machine Learning, pages 948–1024. PMLR, 2022

  2. [2]

    Greedy layer-wise training of deep networks

    Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. InProceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’06, page 153–160, Cambridge, MA, USA, 2006. MIT Press

  3. [3]

    Optimal approx- imation with sparsely connected deep neural networks.SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019

    Helmut B¨ olcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approx- imation with sparsely connected deep neural networks.SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019. DOI: 10.1137/18M118709X

  4. [4]

    A phase shift deep neural network for high frequency approximation and wave problems.SIAM Journal on Scientific Computing, 42(5):A3285– A3312, 2020

    Wei Cai, Xiaoguang Li, and Lizuo Liu. A phase shift deep neural network for high frequency approximation and wave problems.SIAM Journal on Scientific Computing, 42(5):A3285– A3312, 2020. DOI: 10.1137/19M1310050

  5. [5]

    Interpolation, approximation, and controllability of deep neural networks.SIAM Journal on Control and Optimization, 63 (1):625–649, 2025

    Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Interpolation, approximation, and controllability of deep neural networks.SIAM Journal on Control and Optimization, 63 (1):625–649, 2025. DOI: 10.1137/23M1599744

  6. [6]

    Deep learning and the rate of approximation by flows.arXiv e-prints, art

    Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning and the rate of approximation by flows.arXiv e-prints, art. arXiv:2603.15363, March 2026. DOI: 10.48550/arXiv.2603.15363

  7. [7]

    Optimal stable nonlinear approximation.Foundations of Computational Mathematics, 22:607–648,

    Albert Cohen, Ronald DeVore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Optimal stable nonlinear approximation.Foundations of Computational Mathematics, 22:607–648,

  8. [8]

    DOI: 10.1007/s10208-021-09494-z

  9. [9]

    Zico Kolter, and Ameet Talwalkar

    Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

  10. [10]

    Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2:303–314, 1989

    George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals, and Systems, 2:303–314, 1989. DOI: 10.1007/BF02551274

  11. [11]

    Nonlinear approximation and (deep) ReLU networks.Constructive Approximation, 55: 127–172, 2022

    Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova. Nonlinear approximation and (deep) ReLU networks.Constructive Approximation, 55: 127–172, 2022. DOI: 10.1007/s00365-021-09548-z

  12. [12]

    Ronald A. DeVore. Nonlinear approximation.Acta Numerica, 7:51–150, 1998. DOI: 10.1017/S0962492900002816

  13. [13]

    Accurate interpolation for scattered data through hierarchical residual refinement

    Shizhe Ding, Boyang Xia, and Dongbo Bu. Accurate interpolation for scattered data through hierarchical residual refinement. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 9144–9155. Curran Associates, Inc., 2023. URL:https://proceedings. neurips.cc/paper_f...

  14. [14]

    xAFCL: Run Scalable Function Choreographies Across Multiple FaaS Systems.IEEE Transactions on Services Computing.2021:1–1

    Feng-Lei Fan, Dayang Wang, Hengtao Guo, Qikui Zhu, Pingkun Yan, Ge Wang, and Hengy- ong Yu. On a sparse shortcut topology of artificial neural networks.IEEE Transactions on Artificial Intelligence, 3(4):595–608, 2022. DOI: 10.1109/TAI.2021.3128132

  15. [15]

    Gonzalez, Clark Barrett, and Ying Sheng

    Ronglong Fang and Yuesheng Xu. Addressing spectral bias of deep neural networks by multi-grade deep learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 114122–114146. Curran Associates, Inc., 2024. DOI: 10.52202/079017- 3625

  16. [16]

    Computational advantages of multi-grade deep learning: Convergence analysis and performance insights.arXiv e-prints, art

    Ronglong Fang and Yuesheng Xu. Computational advantages of multi-grade deep learning: Convergence analysis and performance insights.arXiv e-prints, art. arXiv:2507.20351, July

  17. [17]

    DOI: 10.48550/arXiv.2507.20351

  18. [18]

    Deep residual learning for image recognition,

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Los Alamitos, CA, USA, June 2016. IEEE Computer Society. DOI: 10.1109/CVPR.2016.90

  19. [19]

    Neural Computation , month =

    Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets.Neural Computation, 18(7):1527–1554, 2006. DOI: 10.1162/neco.2006.18.7.1527

  20. [20]

    Approximation capabilities of multilayer feedforward networks,

    Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Net- works, 4(2):251–257, 1991. ISSN 0893-6080. DOI: 10.1016/0893-6080(91)90009-T

  21. [21]

    1989 , issn =

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators.Neural Networks, 2(5):359–366, 1989. ISSN 0893-6080. DOI: 10.1016/0893-6080(89)90020-8

  22. [22]

    Adaptive multi-grade deep learning for highly oscillatory Fredholm integral equations of the second kind.Journal of Scientific Computing, 106:64,

    Jie Jiang and Yuesheng Xu. Adaptive multi-grade deep learning for highly oscillatory Fredholm integral equations of the second kind.Journal of Scientific Computing, 106:64,

  23. [23]

    DOI: 10.1007/s10915-026-03189-9

  24. [24]

    Yuling Jiao, Yanming Lai, Xiliang Lu, Fengru Wang, Jerry Zhijian Yang, and Yuanyuan Yang. Deep neural networks with ReLU-Sine-Exponential activations break curse of di- mensionality in approximation on H¨ older class.SIAM Journal on Mathematical Analysis, 55(4):3635–3649, 2023. DOI: 10.1137/21M144431X

  25. [25]

    Deep learning via dynamical systems: An ap- proximation perspective.Journal of the European Mathematical Society, 25(5):1671–1709,

    Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems: An ap- proximation perspective.Journal of the European Mathematical Society, 25(5):1671–1709,

  26. [26]

    DOI: 10.4171/JEMS/1221

  27. [27]

    ResNet with one-neuron hidden layers is a universal approximator

    Hongzhou Lin and Stefanie Jegelka. ResNet with one-neuron hidden layers is a universal approximator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL:https://proceedings.neurips.cc/paper/2018/fi le/03bfc1d4783966c69...

  28. [28]

    Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

    Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021. DOI: 10.1137/20M134695X

  29. [29]

    Theory of the frequency principle for general deep neural networks.CSIAM Transactions on Applied Mathematics, 2(3):484– 507, 2021

    Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. Theory of the frequency principle for general deep neural networks.CSIAM Transactions on Applied Mathematics, 2(3):484– 507, 2021. ISSN 2708-0579. DOI: 10.4208/csiam-am.SO-2020-0005

  30. [30]

    Mallat.A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way

    St´ ephane G. Mallat.A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, Inc., Orlando, FL, USA, 3rd edition, 2008. ISBN 0123743702, 9780123743701. 28

  31. [31]

    N-beats: Neural basis expansion analysis for time series forecasting

    Boris Oreshkin, Dmytro Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis expansion analysis for time series forecasting. InInternational Conference on Learning Representations (ICLR), 2020

  32. [32]

    On the spectral bias of neural networks

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Ham- precht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR, 2019

  33. [33]

    Affine systems inL 2(Rd): The analysis of the analysis operator.Journal of Functional Analysis, 148(2):408–447, 1997

    Amos Ron and Zuowei Shen. Affine systems inL 2(Rd): The analysis of the analysis operator.Journal of Functional Analysis, 148(2):408–447, 1997. ISSN 0022-1236. DOI: 10.1006/jfan.1996.3079

  34. [34]

    Nonlinear approximation via compositions

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019. ISSN 0893-6080. DOI: 10.1016/j.neunet.2019.07.011

  35. [35]

    Deep network approximation characterized by number of neurons.Communications in Computational Physics, 28(5):1768–1811, 2020

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons.Communications in Computational Physics, 28(5):1768–1811, 2020. ISSN 1991-7120. DOI: 10.4208/cicp.OA-2020-0149

  36. [36]

    URLhttps://doi.org/10.1162/neco_a_01178

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network with approximation error being reciprocal of width to power of square root of depth.Neural Computation, 33(4): 1005–1036, 03 2021. ISSN 0899-7667. DOI: 10.1162/neco a 01364

  37. [37]

    Neural network approximation: Three hidden layers are enough.Neural Networks, 141:160–173, 2021

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: Three hidden layers are enough.Neural Networks, 141:160–173, 2021. ISSN 0893-6080. DOI: 10.1016/j.neunet.2021.04.011

  38. [38]

    Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23(276):1–60, 2022

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23(276):1–60, 2022. URL:http://jmlr.org/papers/v23/21-1404.html

  39. [39]

    Deep network approximation in terms of intrinsic parameters

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation in terms of intrinsic parameters. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 19909– 19934. PM...

  40. [40]

    Neural network architecture beyond width and depth

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network architecture beyond width and depth. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5669–5681. Curran Associates, Inc., 2022. URL:https://proceedings.neurips.cc/paper_files/p aper/2022/hash/257be12f...

  41. [41]

    Optimal approximation rate of ReLU networks in terms of width and depth.Journal de Math´ ematiques Pures et Appliqu´ ees, 157:101–135, 2022

    Zuowei Shen, Haizhao Yang, and Shijun Zhang. Optimal approximation rate of ReLU networks in terms of width and depth.Journal de Math´ ematiques Pures et Appliqu´ ees, 157:101–135, 2022. ISSN 0021-7824. DOI: 10.1016/j.matpur.2021.07.009

  42. [42]

    Siegel and Jinchao Xu

    Jonathan W. Siegel and Jinchao Xu. High-order approximation rates for shallow neural net- works with cosine and ReLU k activation functions.Applied and Computational Harmonic Analysis, 58:1–26, 2022. ISSN 1063-5203. DOI: 10.1016/j.acha.2021.12.005

  43. [43]

    Implicit neural representations with periodic activation functions

    Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wet- zstein. Implicit neural representations with periodic activation functions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa- tion Processing Systems, volume 33, pages 7462–7473. Curran Associates, Inc., 2020. URL: https:...

  44. [44]

    Fourier features let networks learn high frequency functions in low dimensional domains,

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processin...

  45. [45]

    Don’t fear peculiar activation functions: EUAF and beyond.Neural Networks, 186:107258, 2025

    Qianchao Wang, Shijun Zhang, Dong Zeng, Zhaoheng Xie, Hengtao Guo, Tieyong Zeng, and Feng-Lei Fan. Don’t fear peculiar activation functions: EUAF and beyond.Neural Networks, 186:107258, 2025. ISSN 0893-6080. DOI: 10.1016/j.neunet.2025.107258

  46. [46]

    Multi-grade deep learning.Communications on Applied Mathematics and Computation, 8(2):778–829, 2026

    Yuesheng Xu. Multi-grade deep learning.Communications on Applied Mathematics and Computation, 8(2):778–829, 2026. ISSN 2661-8893. DOI: 10.1007/s42967-024-00474-y

  47. [47]

    Multi-grade deep learning for partial differential equations with applications to the burgers equation.arXiv e-prints, art

    Yuesheng Xu and Taishan Zeng. Multi-grade deep learning for partial differential equations with applications to the burgers equation.arXiv e-prints, art. arXiv:2309.07401, September

  48. [48]

    DOI: 10.48550/arXiv.2309.07401

  49. [49]

    Training behavior of deep neural network in frequency domain

    Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. InNeural Information Processing: 26th International Con- ference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer, 2019

  50. [50]

    Approximation in shift-invariant spaces with deep ReLU neural networks.Neural Networks, 153:269–281, 2022

    Yunfei Yang, Zhen Li, and Yang Wang. Approximation in shift-invariant spaces with deep ReLU neural networks.Neural Networks, 153:269–281, 2022. ISSN 0893-6080. DOI: 10.1016/j.neunet.2022.06.013

  51. [51]

    Error Bounds for Approximations with Deep ReLU Networks,

    Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks.Neural Networks, 94:103–114, 2017. ISSN 0893-6080. DOI: 10.1016/j.neunet.2017.07.002

  52. [52]

    Optimal approximation of continuous functions by very deep ReLU net- works

    Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLU net- works. In S´ ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,Proceedings of the 31st Conference On Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 639–649. PMLR, 06–09 Jul 2018. URL:http://proceedings.mlr.pres s/v75/yarotsky18a.html

  53. [53]

    Elementary superexpressive activations

    Dmitry Yarotsky. Elementary superexpressive activations. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11932–11940. PMLR, 18– 24 Jul 2021. URL:https://proceedings.mlr.press/v139/yarotsky21a.html

  54. [54]

    The phase diagram of approximation rates for deep neural networks

    Dmitry Yarotsky and Anton Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 13005–13015. Curran Associates, Inc., 2020. URL:https://proceedings.neurips.cc/p aper/2020/file/979a3f14bae...

  55. [55]

    Deep network approximation: Beyond ReLU to diverse activation functions.Journal of Machine Learning Research, 25(35):1–39,

    Shijun Zhang, Jianfeng Lu, and Hongkai Zhao. Deep network approximation: Beyond ReLU to diverse activation functions.Journal of Machine Learning Research, 25(35):1–39,

  56. [56]

    URL:http://jmlr.org/papers/v25/23-0912.html

  57. [57]

    Fourier multi-component and multi-layer neural networks: Unlocking high-frequency potential.arXiv preprint arXiv:2502.18959, 2025

    Shijun Zhang, Hongkai Zhao, Yimin Zhong, and Haomin Zhou. Fourier multi-component and multi-layer neural networks: Unlocking high-frequency potential.arXiv e-prints, art. arXiv:2502.18959, February 2025. DOI: 10.48550/arXiv.2502.18959

  58. [58]

    Multigrade neural network approximation

    Shijun Zhang, Zuowei Shen, and Yuesheng Xu. Multigrade neural network approximation. arXiv e-prints, art. arXiv:2601.16884, January 2026. DOI: 10.48550/arXiv.2601.16884. 30