Geometric Layer-wise Approximation Rates for Deep Networks

Shijun Zhang; Yuesheng Xu; Zuowei Shen

arxiv: 2604.20219 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.NA· math.NA· stat.ML

Geometric Layer-wise Approximation Rates for Deep Networks

Shijun Zhang , Zuowei Shen , Yuesheng Xu This is my paper

Pith reviewed 2026-05-10 01:22 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAstat.ML

keywords deep neural networksapproximation ratesmodulus of continuitylayer-wise approximationgeometric convergencemixed activation functionsmultigrade learning

0 comments

The pith

A fixed-width deep network with mixed activations makes every intermediate layer a valid approximant to any L^p function at successively finer scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs one neural network architecture of width 2dN plus d plus 2 that can be extended to any finite depth while ensuring each partial output after ell layers approximates the target with an error controlled by the L^p modulus of continuity evaluated at scale N to the minus ell. This gives depth a concrete meaning as successive correction of residuals at geometrically smaller scales, without discarding earlier layers. For one-Lipschitz targets the bound simplifies to a pure geometric rate proportional to N to the minus ell. The result matters because it supplies a quantitative account of why adding layers improves approximation even when width stays fixed, and it supports adaptive refinement by keeping all prior corrections inside later readouts.

Core claim

We design a single shared mixed-activation architecture of fixed width 2dN+d+2 and any prescribed finite depth such that each intermediate readout Phi_ell is itself an approximant to the target function f. For f in L^p([0,1]^d) with p in [1, infinity), the approximation error of Phi_ell is controlled by (2d+1) times the L^p modulus of continuity at the geometric scale N^{-ell} for all ell. The estimate reduces to the geometric rate (2d+1)N^{-ell} if f is 1-Lipschitz.

What carries the argument

The mixed-activation network with nested intermediate readouts Phi_ell that accumulate corrections at successive geometric scales N^{-ell} while preserving all earlier terms in later outputs.

If this is right

Each added layer refines the approximation at a strictly smaller geometric scale without altering earlier layers.
The same fixed-width network serves as a valid approximant at every depth, enabling depth to act as a continuous refinement parameter.
For Lipschitz functions the error contracts geometrically with depth at a rate independent of the particular function beyond its Lipschitz constant.
The construction supports multigrade learning in which each new layer targets only the residual information left at finer scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training algorithms could monitor layer-wise error reduction on a validation set to decide when to stop deepening the network.
The nested readout structure suggests a possible link between deep networks and classical multi-resolution methods such as wavelets.
Similar geometric layer-wise bounds may hold for other activation families or for approximation in different norms once the existence of the shared architecture is verified.

Load-bearing premise

A single shared mixed-activation architecture of fixed width 2dN+d+2 exists for any prescribed finite depth such that each intermediate readout satisfies the stated modulus-of-continuity error bound.

What would settle it

For d=1, N=2 and the 1-Lipschitz function f(x)=|2x-1| on [0,1], build the network to depth ell=3 and check whether the L^infty approximation error after three layers exceeds 3 times 2^{-3}.

Figures

Figures reproduced from arXiv: 2604.20219 by Shijun Zhang, Yuesheng Xu, Zuowei Shen.

**Figure 2.** Figure 2: Each block Tℓ = ϱ ◦ Aℓ represents one hidden layer, while gℓ = Aout ℓ is an affine output head. The layer-ℓ readout Φℓ is required to approximate the target function with its own explicit error bound. To illustrate our main results, we first consider the approximation of a 1-Lipschitz target function f on [0, 1]d , although the principal results in Section 2 are formulated in the more general setting of L … view at source ↗

**Figure 3.** Figure 3: Multilevel interpretation of Theorem 2.1. For each ℓ ∈ {0, 1, · · · , L}, the readout Φℓ approximates f with εℓ ≤ (2d + 1) ωf,p(N −ℓ ). For convenience, we index the layers from 0 and regard ϱ ◦ A−1 as an initialization layer rather than a true hidden layer. Theorem 2.1 establishes a genuine multilevel approximation principle within a fixed-width architecture. If one writes Φℓ = Φℓ−1 + Γℓ , where Γℓ denote… view at source ↗

**Figure 4.** Figure 4: Interior cubes Qℓ,β and transition regions Ωℓ for N = 2 at the levels ℓ = 0, 1, 2. The approximation itself is built recursively. Starting from f0 := f, we construct refinement modules Γℓ and residuals fℓ+1 through Φℓ := X ℓ j=0 Γj , fℓ+1 := fℓ − Γℓ for ℓ = 0, 1, · · · , L. The purpose of the layer-ℓ module Γℓ is to capture the behavior of the current residual at scale N −ℓ . We also note that Φℓ is consta… view at source ↗

**Figure 5.** Figure 5: The step-proxy function h for N = 4. Define the two-variable map hℓ : R 2 → R 2 by hℓ x y := " x 1 Nℓ h [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: A network realization of hℓ on [0, ∞) 2 . The next proposition provides a recursive encoder for the one-dimensional cell index at the levels ℓ ≥ 1. At the base level ℓ = 0, by contrast, no localization is needed, since there is only one interior cube. Proposition 3.1. For every ℓ ∈ {1, 2, · · · , L} and every j ∈ {0, 1, · · · , Nℓ − 1}, hℓ ◦ · · · ◦ h1 x 0 = x j/Nℓ for all x ∈ Iℓ,j . Proof. We arg… view at source ↗

**Figure 7.** Figure 7: Conceptual overview of the multigrade construction, showing how the encoder and [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: A shared network architecture realizing the layer-wise approximants Φ [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

read the original abstract

Depth is widely viewed as a central contributor to the success of deep neural networks, whereas standard neural network approximation theory typically provides guarantees only for the final output and leaves the role of intermediate layers largely unclear. We address this gap by developing a quantitative framework in which depth admits a precise scale-dependent interpretation. Specifically, we design a single shared mixed-activation architecture of fixed width $2dN+d+2$ and any prescribed finite depth such that each intermediate readout $\Phi_\ell$ is itself an approximant to the target function $f$. For $f\in L^p([0,1]^d)$ with $p\in [1,\infty)$, the approximation error of $\Phi_\ell$ is controlled by $(2d+1)$ times the $L^p$ modulus of continuity at the geometric scale $N^{-\ell}$ for all $\ell$. The estimate reduces to the geometric rate $(2d+1)N^{-\ell}$ if $f$ is $1$-Lipschitz. Our network design is inspired by multigrade deep learning, where depth serves as a progressive refinement mechanism: each new correction targets residual information at a finer scale while the earlier correction terms remain part of the later readouts, yielding a nested architecture that supports adaptive refinement without redesigning the preceding network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a fixed-width mixed-activation construction where each added layer refines the approximation at scale N^{-ℓ} via nested readouts.

read the letter

The central point is that the authors build one shared network of width 2dN+d+2 that works for any finite depth, with each intermediate readout Φ_ℓ satisfying an L^p error bound of (2d+1) times the modulus of continuity of f at scale N^{-ℓ}. For Lipschitz f this simplifies to the clean geometric rate (2d+1)N^{-ℓ}. This is new relative to the usual final-output-only guarantees in neural approximation theory. The nested structure, where later corrections keep earlier terms, is a direct way to make depth correspond to progressive scale refinement without rebuilding the whole thing each time. That matches the multigrade learning intuition they cite and gives a concrete handle on why depth matters quantitatively. The construction appears consistent on the high-level description; there is no visible accumulation of errors or hidden depth dependence in the stated bounds. The main soft spot is that the abstract only asserts existence and the bound, so the actual proof and the precise definition of the mixed activations need to be checked to confirm the width stays independent of depth and that the intermediate readouts really achieve the claimed rate without extra constants. If the full paper supplies a clean inductive construction, this holds up. The work is aimed at people working on approximation theory for deep networks and on multiscale or hierarchical designs. It is worth sending to peer review because it fills a recognized gap with a reproducible quantitative statement, even if the details require referee scrutiny.

Referee Report

0 major / 4 minor

Summary. The paper develops a quantitative framework for interpreting depth in neural networks via a single shared mixed-activation architecture of fixed width 2dN + d + 2 and arbitrary finite depth. Each intermediate readout Φ_ℓ approximates f ∈ L^p([0,1]^d) with error ||Φ_ℓ - f||_p ≤ (2d+1) ω_p(f, N^{-ℓ}), where ω_p is the L^p modulus of continuity at geometric scale N^{-ℓ}; the bound simplifies to the geometric rate (2d+1)N^{-ℓ} when f is 1-Lipschitz. The nested design allows progressive refinement at finer scales while retaining earlier corrections.

Significance. If the explicit construction and bounds are rigorously established, the result is significant for providing the first layer-wise, scale-dependent approximation guarantees in deep network theory, moving beyond final-output-only bounds. The fixed-width shared architecture and parameter-free geometric rates (depending only on d, N, ℓ) are notable strengths that align with multiscale approximation ideas and could inform adaptive network design.

minor comments (4)

The abstract asserts the architecture and bounds without derivation details; the main text must supply the explicit construction of the mixed-activation network (including the specific activations and how the width 2dN+d+2 is achieved) and the full proof of the error bound to permit verification.
Define the L^p modulus of continuity ω_p(f, δ) explicitly in the preliminaries section, including its precise mathematical expression.
Clarify in the main text how the nested readouts are formed (e.g., which layers are read out at each ℓ) and confirm that no error accumulation occurs across depths.
Add a brief comparison in the introduction or related-work section to prior approximation results for deep networks to better highlight the novelty of the intermediate-readout guarantees.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and assessment of the significance of our layer-wise approximation framework. The recommendation for minor revision is noted. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No circularity: constructive multiscale network with independent error bounds

full rationale

The paper presents an explicit construction of a fixed-width mixed-activation network whose intermediate readouts Φ_ℓ satisfy an error bound controlled by the L^p modulus of continuity ω_p(f, N^{-ℓ}). This bound follows directly from the definition of the modulus of continuity once the network is built to realize successive corrections at geometric scales; the Lipschitz case is an immediate specialization. No parameter is fitted to data and then relabeled as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and the derivation does not reduce any claimed result to its own inputs by definition. The architecture is self-contained against external benchmarks (standard modulus-of-continuity estimates) and does not rely on renaming known empirical patterns or smuggling ansatzes via prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of the described architecture and on standard properties of the L^p modulus of continuity; no explicit free parameters beyond the input dimension d and scale parameter N are introduced in the abstract, and no new entities with independent evidence are postulated.

free parameters (1)

N
Scale parameter that sets the geometric refinement level per layer; chosen to achieve desired approximation precision.

axioms (1)

standard math Standard properties of the L^p modulus of continuity on [0,1]^d
The error bound is expressed directly in terms of the modulus of continuity, relying on its known definition and inequalities in L^p spaces.

invented entities (1)

mixed-activation architecture with intermediate readouts Φ_ℓ no independent evidence
purpose: To realize nested, progressive refinement where each layer improves the approximation at a finer geometric scale while preserving earlier terms
New architecture type introduced to achieve the layer-wise guarantees; no independent evidence outside the claimed construction is provided.

pith-pipeline@v0.9.0 · 5537 in / 1536 out tokens · 41215 ms · 2026-05-10T01:22:28.683431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 38 canonical work pages

[1]

Understanding gradient descent on the edge of stability in deep learning

Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. InInternational Conference on Machine Learning, pages 948–1024. PMLR, 2022

2022
[2]

Greedy layer-wise training of deep networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. InProceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’06, page 153–160, Cambridge, MA, USA, 2006. MIT Press

2006
[3]

Optimal approx- imation with sparsely connected deep neural networks.SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019

Helmut B¨ olcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approx- imation with sparsely connected deep neural networks.SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019. DOI: 10.1137/18M118709X

work page doi:10.1137/18m118709x 2019
[4]

A phase shift deep neural network for high frequency approximation and wave problems.SIAM Journal on Scientific Computing, 42(5):A3285– A3312, 2020

Wei Cai, Xiaoguang Li, and Lizuo Liu. A phase shift deep neural network for high frequency approximation and wave problems.SIAM Journal on Scientific Computing, 42(5):A3285– A3312, 2020. DOI: 10.1137/19M1310050

work page doi:10.1137/19m1310050 2020
[5]

Interpolation, approximation, and controllability of deep neural networks.SIAM Journal on Control and Optimization, 63 (1):625–649, 2025

Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Interpolation, approximation, and controllability of deep neural networks.SIAM Journal on Control and Optimization, 63 (1):625–649, 2025. DOI: 10.1137/23M1599744

work page doi:10.1137/23m1599744 2025
[6]

Deep learning and the rate of approximation by flows.arXiv e-prints, art

Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning and the rate of approximation by flows.arXiv e-prints, art. arXiv:2603.15363, March 2026. DOI: 10.48550/arXiv.2603.15363

work page doi:10.48550/arxiv.2603.15363 2026
[7]

Optimal stable nonlinear approximation.Foundations of Computational Mathematics, 22:607–648,

Albert Cohen, Ronald DeVore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Optimal stable nonlinear approximation.Foundations of Computational Mathematics, 22:607–648,
[8]

DOI: 10.1007/s10208-021-09494-z

work page doi:10.1007/s10208-021-09494-z
[9]

Zico Kolter, and Ameet Talwalkar

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

work page arXiv 2021
[10]

Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2:303–314, 1989

George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals, and Systems, 2:303–314, 1989. DOI: 10.1007/BF02551274

work page doi:10.1007/bf02551274 1989
[11]

Nonlinear approximation and (deep) ReLU networks.Constructive Approximation, 55: 127–172, 2022

Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova. Nonlinear approximation and (deep) ReLU networks.Constructive Approximation, 55: 127–172, 2022. DOI: 10.1007/s00365-021-09548-z

work page doi:10.1007/s00365-021-09548-z 2022
[12]

Ronald A. DeVore. Nonlinear approximation.Acta Numerica, 7:51–150, 1998. DOI: 10.1017/S0962492900002816

work page doi:10.1017/s0962492900002816 1998
[13]

Accurate interpolation for scattered data through hierarchical residual refinement

Shizhe Ding, Boyang Xia, and Dongbo Bu. Accurate interpolation for scattered data through hierarchical residual refinement. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 9144–9155. Curran Associates, Inc., 2023. URL:https://proceedings. neurips.cc/paper_f...

2023
[14]

xAFCL: Run Scalable Function Choreographies Across Multiple FaaS Systems.IEEE Transactions on Services Computing.2021:1–1

Feng-Lei Fan, Dayang Wang, Hengtao Guo, Qikui Zhu, Pingkun Yan, Ge Wang, and Hengy- ong Yu. On a sparse shortcut topology of artificial neural networks.IEEE Transactions on Artificial Intelligence, 3(4):595–608, 2022. DOI: 10.1109/TAI.2021.3128132

work page doi:10.1109/tai.2021.3128132 2022
[15]

Gonzalez, Clark Barrett, and Ying Sheng

Ronglong Fang and Yuesheng Xu. Addressing spectral bias of deep neural networks by multi-grade deep learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 114122–114146. Curran Associates, Inc., 2024. DOI: 10.52202/079017- 3625

work page doi:10.52202/079017- 2024
[16]

Computational advantages of multi-grade deep learning: Convergence analysis and performance insights.arXiv e-prints, art

Ronglong Fang and Yuesheng Xu. Computational advantages of multi-grade deep learning: Convergence analysis and performance insights.arXiv e-prints, art. arXiv:2507.20351, July

work page arXiv
[17]

DOI: 10.48550/arXiv.2507.20351

work page doi:10.48550/arxiv.2507.20351
[18]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Los Alamitos, CA, USA, June 2016. IEEE Computer Society. DOI: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[19]

Neural Computation , month =

Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets.Neural Computation, 18(7):1527–1554, 2006. DOI: 10.1162/neco.2006.18.7.1527

work page doi:10.1162/neco.2006.18.7.1527 2006
[20]

Approximation capabilities of multilayer feedforward networks,

Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Net- works, 4(2):251–257, 1991. ISSN 0893-6080. DOI: 10.1016/0893-6080(91)90009-T

work page doi:10.1016/0893-6080(91)90009-t 1991
[21]

1989 , issn =

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators.Neural Networks, 2(5):359–366, 1989. ISSN 0893-6080. DOI: 10.1016/0893-6080(89)90020-8

work page doi:10.1016/0893-6080(89)90020-8 1989
[22]

Adaptive multi-grade deep learning for highly oscillatory Fredholm integral equations of the second kind.Journal of Scientific Computing, 106:64,

Jie Jiang and Yuesheng Xu. Adaptive multi-grade deep learning for highly oscillatory Fredholm integral equations of the second kind.Journal of Scientific Computing, 106:64,
[23]

DOI: 10.1007/s10915-026-03189-9

work page doi:10.1007/s10915-026-03189-9
[24]

Yuling Jiao, Yanming Lai, Xiliang Lu, Fengru Wang, Jerry Zhijian Yang, and Yuanyuan Yang. Deep neural networks with ReLU-Sine-Exponential activations break curse of di- mensionality in approximation on H¨ older class.SIAM Journal on Mathematical Analysis, 55(4):3635–3649, 2023. DOI: 10.1137/21M144431X

work page doi:10.1137/21m144431x 2023
[25]

Deep learning via dynamical systems: An ap- proximation perspective.Journal of the European Mathematical Society, 25(5):1671–1709,

Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems: An ap- proximation perspective.Journal of the European Mathematical Society, 25(5):1671–1709,
[26]

DOI: 10.4171/JEMS/1221

work page doi:10.4171/jems/1221
[27]

ResNet with one-neuron hidden layers is a universal approximator

Hongzhou Lin and Stefanie Jegelka. ResNet with one-neuron hidden layers is a universal approximator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL:https://proceedings.neurips.cc/paper/2018/fi le/03bfc1d4783966c69...

2018
[28]

Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021. DOI: 10.1137/20M134695X

work page doi:10.1137/20m134695x 2021
[29]

Theory of the frequency principle for general deep neural networks.CSIAM Transactions on Applied Mathematics, 2(3):484– 507, 2021

Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. Theory of the frequency principle for general deep neural networks.CSIAM Transactions on Applied Mathematics, 2(3):484– 507, 2021. ISSN 2708-0579. DOI: 10.4208/csiam-am.SO-2020-0005

work page doi:10.4208/csiam-am.so-2020-0005 2021
[30]

Mallat.A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way

St´ ephane G. Mallat.A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, Inc., Orlando, FL, USA, 3rd edition, 2008. ISBN 0123743702, 9780123743701. 28

2008
[31]

N-beats: Neural basis expansion analysis for time series forecasting

Boris Oreshkin, Dmytro Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis expansion analysis for time series forecasting. InInternational Conference on Learning Representations (ICLR), 2020

2020
[32]

On the spectral bias of neural networks

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Ham- precht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR, 2019

2019
[33]

Affine systems inL 2(Rd): The analysis of the analysis operator.Journal of Functional Analysis, 148(2):408–447, 1997

Amos Ron and Zuowei Shen. Affine systems inL 2(Rd): The analysis of the analysis operator.Journal of Functional Analysis, 148(2):408–447, 1997. ISSN 0022-1236. DOI: 10.1006/jfan.1996.3079

work page doi:10.1006/jfan.1996.3079 1997
[34]

Nonlinear approximation via compositions

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019. ISSN 0893-6080. DOI: 10.1016/j.neunet.2019.07.011

work page doi:10.1016/j.neunet.2019.07.011 2019
[35]

Deep network approximation characterized by number of neurons.Communications in Computational Physics, 28(5):1768–1811, 2020

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons.Communications in Computational Physics, 28(5):1768–1811, 2020. ISSN 1991-7120. DOI: 10.4208/cicp.OA-2020-0149

work page doi:10.4208/cicp.oa-2020-0149 2020
[36]

URLhttps://doi.org/10.1162/neco_a_01178

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network with approximation error being reciprocal of width to power of square root of depth.Neural Computation, 33(4): 1005–1036, 03 2021. ISSN 0899-7667. DOI: 10.1162/neco a 01364

work page doi:10.1162/neco 2021
[37]

Neural network approximation: Three hidden layers are enough.Neural Networks, 141:160–173, 2021

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: Three hidden layers are enough.Neural Networks, 141:160–173, 2021. ISSN 0893-6080. DOI: 10.1016/j.neunet.2021.04.011

work page doi:10.1016/j.neunet.2021.04.011 2021
[38]

Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23(276):1–60, 2022

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23(276):1–60, 2022. URL:http://jmlr.org/papers/v23/21-1404.html

2022
[39]

Deep network approximation in terms of intrinsic parameters

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation in terms of intrinsic parameters. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 19909– 19934. PM...

2022
[40]

Neural network architecture beyond width and depth

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network architecture beyond width and depth. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5669–5681. Curran Associates, Inc., 2022. URL:https://proceedings.neurips.cc/paper_files/p aper/2022/hash/257be12f...

2022
[41]

Optimal approximation rate of ReLU networks in terms of width and depth.Journal de Math´ ematiques Pures et Appliqu´ ees, 157:101–135, 2022

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Optimal approximation rate of ReLU networks in terms of width and depth.Journal de Math´ ematiques Pures et Appliqu´ ees, 157:101–135, 2022. ISSN 0021-7824. DOI: 10.1016/j.matpur.2021.07.009

work page doi:10.1016/j.matpur.2021.07.009 2022
[42]

Siegel and Jinchao Xu

Jonathan W. Siegel and Jinchao Xu. High-order approximation rates for shallow neural net- works with cosine and ReLU k activation functions.Applied and Computational Harmonic Analysis, 58:1–26, 2022. ISSN 1063-5203. DOI: 10.1016/j.acha.2021.12.005

work page doi:10.1016/j.acha.2021.12.005 2022
[43]

Implicit neural representations with periodic activation functions

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wet- zstein. Implicit neural representations with periodic activation functions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa- tion Processing Systems, volume 33, pages 7462–7473. Curran Associates, Inc., 2020. URL: https:...

2020
[44]

Fourier features let networks learn high frequency functions in low dimensional domains,

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processin...

work page arXiv 2020
[45]

Don’t fear peculiar activation functions: EUAF and beyond.Neural Networks, 186:107258, 2025

Qianchao Wang, Shijun Zhang, Dong Zeng, Zhaoheng Xie, Hengtao Guo, Tieyong Zeng, and Feng-Lei Fan. Don’t fear peculiar activation functions: EUAF and beyond.Neural Networks, 186:107258, 2025. ISSN 0893-6080. DOI: 10.1016/j.neunet.2025.107258

work page doi:10.1016/j.neunet.2025.107258 2025
[46]

Multi-grade deep learning.Communications on Applied Mathematics and Computation, 8(2):778–829, 2026

Yuesheng Xu. Multi-grade deep learning.Communications on Applied Mathematics and Computation, 8(2):778–829, 2026. ISSN 2661-8893. DOI: 10.1007/s42967-024-00474-y

work page doi:10.1007/s42967-024-00474-y 2026
[47]

Multi-grade deep learning for partial differential equations with applications to the burgers equation.arXiv e-prints, art

Yuesheng Xu and Taishan Zeng. Multi-grade deep learning for partial differential equations with applications to the burgers equation.arXiv e-prints, art. arXiv:2309.07401, September

work page arXiv
[48]

DOI: 10.48550/arXiv.2309.07401

work page doi:10.48550/arxiv.2309.07401
[49]

Training behavior of deep neural network in frequency domain

Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. InNeural Information Processing: 26th International Con- ference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer, 2019

2019
[50]

Approximation in shift-invariant spaces with deep ReLU neural networks.Neural Networks, 153:269–281, 2022

Yunfei Yang, Zhen Li, and Yang Wang. Approximation in shift-invariant spaces with deep ReLU neural networks.Neural Networks, 153:269–281, 2022. ISSN 0893-6080. DOI: 10.1016/j.neunet.2022.06.013

work page doi:10.1016/j.neunet.2022.06.013 2022
[51]

Error Bounds for Approximations with Deep ReLU Networks,

Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks.Neural Networks, 94:103–114, 2017. ISSN 0893-6080. DOI: 10.1016/j.neunet.2017.07.002

work page doi:10.1016/j.neunet.2017.07.002 2017
[52]

Optimal approximation of continuous functions by very deep ReLU net- works

Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLU net- works. In S´ ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,Proceedings of the 31st Conference On Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 639–649. PMLR, 06–09 Jul 2018. URL:http://proceedings.mlr.pres s/v75/yarotsky18a.html

2018
[53]

Elementary superexpressive activations

Dmitry Yarotsky. Elementary superexpressive activations. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11932–11940. PMLR, 18– 24 Jul 2021. URL:https://proceedings.mlr.press/v139/yarotsky21a.html

2021
[54]

The phase diagram of approximation rates for deep neural networks

Dmitry Yarotsky and Anton Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 13005–13015. Curran Associates, Inc., 2020. URL:https://proceedings.neurips.cc/p aper/2020/file/979a3f14bae...

2020
[55]

Deep network approximation: Beyond ReLU to diverse activation functions.Journal of Machine Learning Research, 25(35):1–39,

Shijun Zhang, Jianfeng Lu, and Hongkai Zhao. Deep network approximation: Beyond ReLU to diverse activation functions.Journal of Machine Learning Research, 25(35):1–39,
[56]

URL:http://jmlr.org/papers/v25/23-0912.html
[57]

Fourier multi-component and multi-layer neural networks: Unlocking high-frequency potential.arXiv preprint arXiv:2502.18959, 2025

Shijun Zhang, Hongkai Zhao, Yimin Zhong, and Haomin Zhou. Fourier multi-component and multi-layer neural networks: Unlocking high-frequency potential.arXiv e-prints, art. arXiv:2502.18959, February 2025. DOI: 10.48550/arXiv.2502.18959

work page doi:10.48550/arxiv.2502.18959 2025
[58]

Multigrade neural network approximation

Shijun Zhang, Zuowei Shen, and Yuesheng Xu. Multigrade neural network approximation. arXiv e-prints, art. arXiv:2601.16884, January 2026. DOI: 10.48550/arXiv.2601.16884. 30

work page doi:10.48550/arxiv.2601.16884 2026

[1] [1]

Understanding gradient descent on the edge of stability in deep learning

Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. InInternational Conference on Machine Learning, pages 948–1024. PMLR, 2022

2022

[2] [2]

Greedy layer-wise training of deep networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. InProceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’06, page 153–160, Cambridge, MA, USA, 2006. MIT Press

2006

[3] [3]

Optimal approx- imation with sparsely connected deep neural networks.SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019

Helmut B¨ olcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approx- imation with sparsely connected deep neural networks.SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019. DOI: 10.1137/18M118709X

work page doi:10.1137/18m118709x 2019

[4] [4]

A phase shift deep neural network for high frequency approximation and wave problems.SIAM Journal on Scientific Computing, 42(5):A3285– A3312, 2020

Wei Cai, Xiaoguang Li, and Lizuo Liu. A phase shift deep neural network for high frequency approximation and wave problems.SIAM Journal on Scientific Computing, 42(5):A3285– A3312, 2020. DOI: 10.1137/19M1310050

work page doi:10.1137/19m1310050 2020

[5] [5]

Interpolation, approximation, and controllability of deep neural networks.SIAM Journal on Control and Optimization, 63 (1):625–649, 2025

Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Interpolation, approximation, and controllability of deep neural networks.SIAM Journal on Control and Optimization, 63 (1):625–649, 2025. DOI: 10.1137/23M1599744

work page doi:10.1137/23m1599744 2025

[6] [6]

Deep learning and the rate of approximation by flows.arXiv e-prints, art

Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning and the rate of approximation by flows.arXiv e-prints, art. arXiv:2603.15363, March 2026. DOI: 10.48550/arXiv.2603.15363

work page doi:10.48550/arxiv.2603.15363 2026

[7] [7]

Optimal stable nonlinear approximation.Foundations of Computational Mathematics, 22:607–648,

Albert Cohen, Ronald DeVore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Optimal stable nonlinear approximation.Foundations of Computational Mathematics, 22:607–648,

[8] [8]

DOI: 10.1007/s10208-021-09494-z

work page doi:10.1007/s10208-021-09494-z

[9] [9]

Zico Kolter, and Ameet Talwalkar

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

work page arXiv 2021

[10] [10]

Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2:303–314, 1989

George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals, and Systems, 2:303–314, 1989. DOI: 10.1007/BF02551274

work page doi:10.1007/bf02551274 1989

[11] [11]

Nonlinear approximation and (deep) ReLU networks.Constructive Approximation, 55: 127–172, 2022

Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova. Nonlinear approximation and (deep) ReLU networks.Constructive Approximation, 55: 127–172, 2022. DOI: 10.1007/s00365-021-09548-z

work page doi:10.1007/s00365-021-09548-z 2022

[12] [12]

Ronald A. DeVore. Nonlinear approximation.Acta Numerica, 7:51–150, 1998. DOI: 10.1017/S0962492900002816

work page doi:10.1017/s0962492900002816 1998

[13] [13]

Accurate interpolation for scattered data through hierarchical residual refinement

Shizhe Ding, Boyang Xia, and Dongbo Bu. Accurate interpolation for scattered data through hierarchical residual refinement. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 9144–9155. Curran Associates, Inc., 2023. URL:https://proceedings. neurips.cc/paper_f...

2023

[14] [14]

xAFCL: Run Scalable Function Choreographies Across Multiple FaaS Systems.IEEE Transactions on Services Computing.2021:1–1

Feng-Lei Fan, Dayang Wang, Hengtao Guo, Qikui Zhu, Pingkun Yan, Ge Wang, and Hengy- ong Yu. On a sparse shortcut topology of artificial neural networks.IEEE Transactions on Artificial Intelligence, 3(4):595–608, 2022. DOI: 10.1109/TAI.2021.3128132

work page doi:10.1109/tai.2021.3128132 2022

[15] [15]

Gonzalez, Clark Barrett, and Ying Sheng

Ronglong Fang and Yuesheng Xu. Addressing spectral bias of deep neural networks by multi-grade deep learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 114122–114146. Curran Associates, Inc., 2024. DOI: 10.52202/079017- 3625

work page doi:10.52202/079017- 2024

[16] [16]

Computational advantages of multi-grade deep learning: Convergence analysis and performance insights.arXiv e-prints, art

Ronglong Fang and Yuesheng Xu. Computational advantages of multi-grade deep learning: Convergence analysis and performance insights.arXiv e-prints, art. arXiv:2507.20351, July

work page arXiv

[17] [17]

DOI: 10.48550/arXiv.2507.20351

work page doi:10.48550/arxiv.2507.20351

[18] [18]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Los Alamitos, CA, USA, June 2016. IEEE Computer Society. DOI: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[19] [19]

Neural Computation , month =

Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets.Neural Computation, 18(7):1527–1554, 2006. DOI: 10.1162/neco.2006.18.7.1527

work page doi:10.1162/neco.2006.18.7.1527 2006

[20] [20]

Approximation capabilities of multilayer feedforward networks,

Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Net- works, 4(2):251–257, 1991. ISSN 0893-6080. DOI: 10.1016/0893-6080(91)90009-T

work page doi:10.1016/0893-6080(91)90009-t 1991

[21] [21]

1989 , issn =

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators.Neural Networks, 2(5):359–366, 1989. ISSN 0893-6080. DOI: 10.1016/0893-6080(89)90020-8

work page doi:10.1016/0893-6080(89)90020-8 1989

[22] [22]

Adaptive multi-grade deep learning for highly oscillatory Fredholm integral equations of the second kind.Journal of Scientific Computing, 106:64,

Jie Jiang and Yuesheng Xu. Adaptive multi-grade deep learning for highly oscillatory Fredholm integral equations of the second kind.Journal of Scientific Computing, 106:64,

[23] [23]

DOI: 10.1007/s10915-026-03189-9

work page doi:10.1007/s10915-026-03189-9

[24] [24]

Yuling Jiao, Yanming Lai, Xiliang Lu, Fengru Wang, Jerry Zhijian Yang, and Yuanyuan Yang. Deep neural networks with ReLU-Sine-Exponential activations break curse of di- mensionality in approximation on H¨ older class.SIAM Journal on Mathematical Analysis, 55(4):3635–3649, 2023. DOI: 10.1137/21M144431X

work page doi:10.1137/21m144431x 2023

[25] [25]

Deep learning via dynamical systems: An ap- proximation perspective.Journal of the European Mathematical Society, 25(5):1671–1709,

Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems: An ap- proximation perspective.Journal of the European Mathematical Society, 25(5):1671–1709,

[26] [26]

DOI: 10.4171/JEMS/1221

work page doi:10.4171/jems/1221

[27] [27]

ResNet with one-neuron hidden layers is a universal approximator

Hongzhou Lin and Stefanie Jegelka. ResNet with one-neuron hidden layers is a universal approximator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL:https://proceedings.neurips.cc/paper/2018/fi le/03bfc1d4783966c69...

2018

[28] [28]

Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021. DOI: 10.1137/20M134695X

work page doi:10.1137/20m134695x 2021

[29] [29]

Theory of the frequency principle for general deep neural networks.CSIAM Transactions on Applied Mathematics, 2(3):484– 507, 2021

Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. Theory of the frequency principle for general deep neural networks.CSIAM Transactions on Applied Mathematics, 2(3):484– 507, 2021. ISSN 2708-0579. DOI: 10.4208/csiam-am.SO-2020-0005

work page doi:10.4208/csiam-am.so-2020-0005 2021

[30] [30]

Mallat.A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way

St´ ephane G. Mallat.A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, Inc., Orlando, FL, USA, 3rd edition, 2008. ISBN 0123743702, 9780123743701. 28

2008

[31] [31]

N-beats: Neural basis expansion analysis for time series forecasting

Boris Oreshkin, Dmytro Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis expansion analysis for time series forecasting. InInternational Conference on Learning Representations (ICLR), 2020

2020

[32] [32]

On the spectral bias of neural networks

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Ham- precht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR, 2019

2019

[33] [33]

Affine systems inL 2(Rd): The analysis of the analysis operator.Journal of Functional Analysis, 148(2):408–447, 1997

Amos Ron and Zuowei Shen. Affine systems inL 2(Rd): The analysis of the analysis operator.Journal of Functional Analysis, 148(2):408–447, 1997. ISSN 0022-1236. DOI: 10.1006/jfan.1996.3079

work page doi:10.1006/jfan.1996.3079 1997

[34] [34]

Nonlinear approximation via compositions

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019. ISSN 0893-6080. DOI: 10.1016/j.neunet.2019.07.011

work page doi:10.1016/j.neunet.2019.07.011 2019

[35] [35]

Deep network approximation characterized by number of neurons.Communications in Computational Physics, 28(5):1768–1811, 2020

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons.Communications in Computational Physics, 28(5):1768–1811, 2020. ISSN 1991-7120. DOI: 10.4208/cicp.OA-2020-0149

work page doi:10.4208/cicp.oa-2020-0149 2020

[36] [36]

URLhttps://doi.org/10.1162/neco_a_01178

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network with approximation error being reciprocal of width to power of square root of depth.Neural Computation, 33(4): 1005–1036, 03 2021. ISSN 0899-7667. DOI: 10.1162/neco a 01364

work page doi:10.1162/neco 2021

[37] [37]

Neural network approximation: Three hidden layers are enough.Neural Networks, 141:160–173, 2021

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: Three hidden layers are enough.Neural Networks, 141:160–173, 2021. ISSN 0893-6080. DOI: 10.1016/j.neunet.2021.04.011

work page doi:10.1016/j.neunet.2021.04.011 2021

[38] [38]

Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23(276):1–60, 2022

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23(276):1–60, 2022. URL:http://jmlr.org/papers/v23/21-1404.html

2022

[39] [39]

Deep network approximation in terms of intrinsic parameters

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation in terms of intrinsic parameters. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 19909– 19934. PM...

2022

[40] [40]

Neural network architecture beyond width and depth

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network architecture beyond width and depth. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5669–5681. Curran Associates, Inc., 2022. URL:https://proceedings.neurips.cc/paper_files/p aper/2022/hash/257be12f...

2022

[41] [41]

Optimal approximation rate of ReLU networks in terms of width and depth.Journal de Math´ ematiques Pures et Appliqu´ ees, 157:101–135, 2022

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Optimal approximation rate of ReLU networks in terms of width and depth.Journal de Math´ ematiques Pures et Appliqu´ ees, 157:101–135, 2022. ISSN 0021-7824. DOI: 10.1016/j.matpur.2021.07.009

work page doi:10.1016/j.matpur.2021.07.009 2022

[42] [42]

Siegel and Jinchao Xu

Jonathan W. Siegel and Jinchao Xu. High-order approximation rates for shallow neural net- works with cosine and ReLU k activation functions.Applied and Computational Harmonic Analysis, 58:1–26, 2022. ISSN 1063-5203. DOI: 10.1016/j.acha.2021.12.005

work page doi:10.1016/j.acha.2021.12.005 2022

[43] [43]

Implicit neural representations with periodic activation functions

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wet- zstein. Implicit neural representations with periodic activation functions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa- tion Processing Systems, volume 33, pages 7462–7473. Curran Associates, Inc., 2020. URL: https:...

2020

[44] [44]

Fourier features let networks learn high frequency functions in low dimensional domains,

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processin...

work page arXiv 2020

[45] [45]

Don’t fear peculiar activation functions: EUAF and beyond.Neural Networks, 186:107258, 2025

Qianchao Wang, Shijun Zhang, Dong Zeng, Zhaoheng Xie, Hengtao Guo, Tieyong Zeng, and Feng-Lei Fan. Don’t fear peculiar activation functions: EUAF and beyond.Neural Networks, 186:107258, 2025. ISSN 0893-6080. DOI: 10.1016/j.neunet.2025.107258

work page doi:10.1016/j.neunet.2025.107258 2025

[46] [46]

Multi-grade deep learning.Communications on Applied Mathematics and Computation, 8(2):778–829, 2026

Yuesheng Xu. Multi-grade deep learning.Communications on Applied Mathematics and Computation, 8(2):778–829, 2026. ISSN 2661-8893. DOI: 10.1007/s42967-024-00474-y

work page doi:10.1007/s42967-024-00474-y 2026

[47] [47]

Multi-grade deep learning for partial differential equations with applications to the burgers equation.arXiv e-prints, art

Yuesheng Xu and Taishan Zeng. Multi-grade deep learning for partial differential equations with applications to the burgers equation.arXiv e-prints, art. arXiv:2309.07401, September

work page arXiv

[48] [48]

DOI: 10.48550/arXiv.2309.07401

work page doi:10.48550/arxiv.2309.07401

[49] [49]

Training behavior of deep neural network in frequency domain

Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. InNeural Information Processing: 26th International Con- ference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer, 2019

2019

[50] [50]

Approximation in shift-invariant spaces with deep ReLU neural networks.Neural Networks, 153:269–281, 2022

Yunfei Yang, Zhen Li, and Yang Wang. Approximation in shift-invariant spaces with deep ReLU neural networks.Neural Networks, 153:269–281, 2022. ISSN 0893-6080. DOI: 10.1016/j.neunet.2022.06.013

work page doi:10.1016/j.neunet.2022.06.013 2022

[51] [51]

Error Bounds for Approximations with Deep ReLU Networks,

Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks.Neural Networks, 94:103–114, 2017. ISSN 0893-6080. DOI: 10.1016/j.neunet.2017.07.002

work page doi:10.1016/j.neunet.2017.07.002 2017

[52] [52]

Optimal approximation of continuous functions by very deep ReLU net- works

Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLU net- works. In S´ ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,Proceedings of the 31st Conference On Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 639–649. PMLR, 06–09 Jul 2018. URL:http://proceedings.mlr.pres s/v75/yarotsky18a.html

2018

[53] [53]

Elementary superexpressive activations

Dmitry Yarotsky. Elementary superexpressive activations. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11932–11940. PMLR, 18– 24 Jul 2021. URL:https://proceedings.mlr.press/v139/yarotsky21a.html

2021

[54] [54]

The phase diagram of approximation rates for deep neural networks

Dmitry Yarotsky and Anton Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 13005–13015. Curran Associates, Inc., 2020. URL:https://proceedings.neurips.cc/p aper/2020/file/979a3f14bae...

2020

[55] [55]

Deep network approximation: Beyond ReLU to diverse activation functions.Journal of Machine Learning Research, 25(35):1–39,

Shijun Zhang, Jianfeng Lu, and Hongkai Zhao. Deep network approximation: Beyond ReLU to diverse activation functions.Journal of Machine Learning Research, 25(35):1–39,

[56] [56]

URL:http://jmlr.org/papers/v25/23-0912.html

[57] [57]

Fourier multi-component and multi-layer neural networks: Unlocking high-frequency potential.arXiv preprint arXiv:2502.18959, 2025

Shijun Zhang, Hongkai Zhao, Yimin Zhong, and Haomin Zhou. Fourier multi-component and multi-layer neural networks: Unlocking high-frequency potential.arXiv e-prints, art. arXiv:2502.18959, February 2025. DOI: 10.48550/arXiv.2502.18959

work page doi:10.48550/arxiv.2502.18959 2025

[58] [58]

Multigrade neural network approximation

Shijun Zhang, Zuowei Shen, and Yuesheng Xu. Multigrade neural network approximation. arXiv e-prints, art. arXiv:2601.16884, January 2026. DOI: 10.48550/arXiv.2601.16884. 30

work page doi:10.48550/arxiv.2601.16884 2026