Geometric Layer-wise Approximation Rates for Deep Networks
Pith reviewed 2026-05-10 01:22 UTC · model grok-4.3
The pith
A fixed-width deep network with mixed activations makes every intermediate layer a valid approximant to any L^p function at successively finer scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We design a single shared mixed-activation architecture of fixed width 2dN+d+2 and any prescribed finite depth such that each intermediate readout Phi_ell is itself an approximant to the target function f. For f in L^p([0,1]^d) with p in [1, infinity), the approximation error of Phi_ell is controlled by (2d+1) times the L^p modulus of continuity at the geometric scale N^{-ell} for all ell. The estimate reduces to the geometric rate (2d+1)N^{-ell} if f is 1-Lipschitz.
What carries the argument
The mixed-activation network with nested intermediate readouts Phi_ell that accumulate corrections at successive geometric scales N^{-ell} while preserving all earlier terms in later outputs.
If this is right
- Each added layer refines the approximation at a strictly smaller geometric scale without altering earlier layers.
- The same fixed-width network serves as a valid approximant at every depth, enabling depth to act as a continuous refinement parameter.
- For Lipschitz functions the error contracts geometrically with depth at a rate independent of the particular function beyond its Lipschitz constant.
- The construction supports multigrade learning in which each new layer targets only the residual information left at finer scales.
Where Pith is reading between the lines
- Training algorithms could monitor layer-wise error reduction on a validation set to decide when to stop deepening the network.
- The nested readout structure suggests a possible link between deep networks and classical multi-resolution methods such as wavelets.
- Similar geometric layer-wise bounds may hold for other activation families or for approximation in different norms once the existence of the shared architecture is verified.
Load-bearing premise
A single shared mixed-activation architecture of fixed width 2dN+d+2 exists for any prescribed finite depth such that each intermediate readout satisfies the stated modulus-of-continuity error bound.
What would settle it
For d=1, N=2 and the 1-Lipschitz function f(x)=|2x-1| on [0,1], build the network to depth ell=3 and check whether the L^infty approximation error after three layers exceeds 3 times 2^{-3}.
Figures
read the original abstract
Depth is widely viewed as a central contributor to the success of deep neural networks, whereas standard neural network approximation theory typically provides guarantees only for the final output and leaves the role of intermediate layers largely unclear. We address this gap by developing a quantitative framework in which depth admits a precise scale-dependent interpretation. Specifically, we design a single shared mixed-activation architecture of fixed width $2dN+d+2$ and any prescribed finite depth such that each intermediate readout $\Phi_\ell$ is itself an approximant to the target function $f$. For $f\in L^p([0,1]^d)$ with $p\in [1,\infty)$, the approximation error of $\Phi_\ell$ is controlled by $(2d+1)$ times the $L^p$ modulus of continuity at the geometric scale $N^{-\ell}$ for all $\ell$. The estimate reduces to the geometric rate $(2d+1)N^{-\ell}$ if $f$ is $1$-Lipschitz. Our network design is inspired by multigrade deep learning, where depth serves as a progressive refinement mechanism: each new correction targets residual information at a finer scale while the earlier correction terms remain part of the later readouts, yielding a nested architecture that supports adaptive refinement without redesigning the preceding network.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a quantitative framework for interpreting depth in neural networks via a single shared mixed-activation architecture of fixed width 2dN + d + 2 and arbitrary finite depth. Each intermediate readout Φ_ℓ approximates f ∈ L^p([0,1]^d) with error ||Φ_ℓ - f||_p ≤ (2d+1) ω_p(f, N^{-ℓ}), where ω_p is the L^p modulus of continuity at geometric scale N^{-ℓ}; the bound simplifies to the geometric rate (2d+1)N^{-ℓ} when f is 1-Lipschitz. The nested design allows progressive refinement at finer scales while retaining earlier corrections.
Significance. If the explicit construction and bounds are rigorously established, the result is significant for providing the first layer-wise, scale-dependent approximation guarantees in deep network theory, moving beyond final-output-only bounds. The fixed-width shared architecture and parameter-free geometric rates (depending only on d, N, ℓ) are notable strengths that align with multiscale approximation ideas and could inform adaptive network design.
minor comments (4)
- The abstract asserts the architecture and bounds without derivation details; the main text must supply the explicit construction of the mixed-activation network (including the specific activations and how the width 2dN+d+2 is achieved) and the full proof of the error bound to permit verification.
- Define the L^p modulus of continuity ω_p(f, δ) explicitly in the preliminaries section, including its precise mathematical expression.
- Clarify in the main text how the nested readouts are formed (e.g., which layers are read out at each ℓ) and confirm that no error accumulation occurs across depths.
- Add a brief comparison in the introduction or related-work section to prior approximation results for deep networks to better highlight the novelty of the intermediate-readout guarantees.
Simulated Author's Rebuttal
We thank the referee for their positive summary and assessment of the significance of our layer-wise approximation framework. The recommendation for minor revision is noted. No specific major comments were provided in the report.
Circularity Check
No circularity: constructive multiscale network with independent error bounds
full rationale
The paper presents an explicit construction of a fixed-width mixed-activation network whose intermediate readouts Φ_ℓ satisfy an error bound controlled by the L^p modulus of continuity ω_p(f, N^{-ℓ}). This bound follows directly from the definition of the modulus of continuity once the network is built to realize successive corrections at geometric scales; the Lipschitz case is an immediate specialization. No parameter is fitted to data and then relabeled as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and the derivation does not reduce any claimed result to its own inputs by definition. The architecture is self-contained against external benchmarks (standard modulus-of-continuity estimates) and does not rely on renaming known empirical patterns or smuggling ansatzes via prior work.
Axiom & Free-Parameter Ledger
free parameters (1)
- N
axioms (1)
- standard math Standard properties of the L^p modulus of continuity on [0,1]^d
invented entities (1)
-
mixed-activation architecture with intermediate readouts Φ_ℓ
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Understanding gradient descent on the edge of stability in deep learning
Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. InInternational Conference on Machine Learning, pages 948–1024. PMLR, 2022
2022
-
[2]
Greedy layer-wise training of deep networks
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. InProceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’06, page 153–160, Cambridge, MA, USA, 2006. MIT Press
2006
-
[3]
Helmut B¨ olcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approx- imation with sparsely connected deep neural networks.SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019. DOI: 10.1137/18M118709X
-
[4]
Wei Cai, Xiaoguang Li, and Lizuo Liu. A phase shift deep neural network for high frequency approximation and wave problems.SIAM Journal on Scientific Computing, 42(5):A3285– A3312, 2020. DOI: 10.1137/19M1310050
-
[5]
Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Interpolation, approximation, and controllability of deep neural networks.SIAM Journal on Control and Optimization, 63 (1):625–649, 2025. DOI: 10.1137/23M1599744
-
[6]
Deep learning and the rate of approximation by flows.arXiv e-prints, art
Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning and the rate of approximation by flows.arXiv e-prints, art. arXiv:2603.15363, March 2026. DOI: 10.48550/arXiv.2603.15363
-
[7]
Optimal stable nonlinear approximation.Foundations of Computational Mathematics, 22:607–648,
Albert Cohen, Ronald DeVore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Optimal stable nonlinear approximation.Foundations of Computational Mathematics, 22:607–648,
-
[8]
DOI: 10.1007/s10208-021-09494-z
-
[9]
Zico Kolter, and Ameet Talwalkar
Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021
-
[10]
George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals, and Systems, 2:303–314, 1989. DOI: 10.1007/BF02551274
-
[11]
Nonlinear approximation and (deep) ReLU networks.Constructive Approximation, 55: 127–172, 2022
Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova. Nonlinear approximation and (deep) ReLU networks.Constructive Approximation, 55: 127–172, 2022. DOI: 10.1007/s00365-021-09548-z
-
[12]
Ronald A. DeVore. Nonlinear approximation.Acta Numerica, 7:51–150, 1998. DOI: 10.1017/S0962492900002816
-
[13]
Accurate interpolation for scattered data through hierarchical residual refinement
Shizhe Ding, Boyang Xia, and Dongbo Bu. Accurate interpolation for scattered data through hierarchical residual refinement. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 9144–9155. Curran Associates, Inc., 2023. URL:https://proceedings. neurips.cc/paper_f...
2023
-
[14]
Feng-Lei Fan, Dayang Wang, Hengtao Guo, Qikui Zhu, Pingkun Yan, Ge Wang, and Hengy- ong Yu. On a sparse shortcut topology of artificial neural networks.IEEE Transactions on Artificial Intelligence, 3(4):595–608, 2022. DOI: 10.1109/TAI.2021.3128132
-
[15]
Gonzalez, Clark Barrett, and Ying Sheng
Ronglong Fang and Yuesheng Xu. Addressing spectral bias of deep neural networks by multi-grade deep learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 114122–114146. Curran Associates, Inc., 2024. DOI: 10.52202/079017- 3625
-
[16]
Ronglong Fang and Yuesheng Xu. Computational advantages of multi-grade deep learning: Convergence analysis and performance insights.arXiv e-prints, art. arXiv:2507.20351, July
-
[17]
DOI: 10.48550/arXiv.2507.20351
-
[18]
Deep residual learning for image recognition,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Los Alamitos, CA, USA, June 2016. IEEE Computer Society. DOI: 10.1109/CVPR.2016.90
-
[19]
Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets.Neural Computation, 18(7):1527–1554, 2006. DOI: 10.1162/neco.2006.18.7.1527
-
[20]
Approximation capabilities of multilayer feedforward networks,
Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Net- works, 4(2):251–257, 1991. ISSN 0893-6080. DOI: 10.1016/0893-6080(91)90009-T
-
[21]
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators.Neural Networks, 2(5):359–366, 1989. ISSN 0893-6080. DOI: 10.1016/0893-6080(89)90020-8
-
[22]
Adaptive multi-grade deep learning for highly oscillatory Fredholm integral equations of the second kind.Journal of Scientific Computing, 106:64,
Jie Jiang and Yuesheng Xu. Adaptive multi-grade deep learning for highly oscillatory Fredholm integral equations of the second kind.Journal of Scientific Computing, 106:64,
-
[23]
DOI: 10.1007/s10915-026-03189-9
-
[24]
Yuling Jiao, Yanming Lai, Xiliang Lu, Fengru Wang, Jerry Zhijian Yang, and Yuanyuan Yang. Deep neural networks with ReLU-Sine-Exponential activations break curse of di- mensionality in approximation on H¨ older class.SIAM Journal on Mathematical Analysis, 55(4):3635–3649, 2023. DOI: 10.1137/21M144431X
-
[25]
Deep learning via dynamical systems: An ap- proximation perspective.Journal of the European Mathematical Society, 25(5):1671–1709,
Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems: An ap- proximation perspective.Journal of the European Mathematical Society, 25(5):1671–1709,
-
[26]
DOI: 10.4171/JEMS/1221
-
[27]
ResNet with one-neuron hidden layers is a universal approximator
Hongzhou Lin and Stefanie Jegelka. ResNet with one-neuron hidden layers is a universal approximator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL:https://proceedings.neurips.cc/paper/2018/fi le/03bfc1d4783966c69...
2018
-
[28]
Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021. DOI: 10.1137/20M134695X
-
[29]
Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. Theory of the frequency principle for general deep neural networks.CSIAM Transactions on Applied Mathematics, 2(3):484– 507, 2021. ISSN 2708-0579. DOI: 10.4208/csiam-am.SO-2020-0005
-
[30]
Mallat.A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way
St´ ephane G. Mallat.A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, Inc., Orlando, FL, USA, 3rd edition, 2008. ISBN 0123743702, 9780123743701. 28
2008
-
[31]
N-beats: Neural basis expansion analysis for time series forecasting
Boris Oreshkin, Dmytro Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis expansion analysis for time series forecasting. InInternational Conference on Learning Representations (ICLR), 2020
2020
-
[32]
On the spectral bias of neural networks
Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Ham- precht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR, 2019
2019
-
[33]
Amos Ron and Zuowei Shen. Affine systems inL 2(Rd): The analysis of the analysis operator.Journal of Functional Analysis, 148(2):408–447, 1997. ISSN 0022-1236. DOI: 10.1006/jfan.1996.3079
-
[34]
Nonlinear approximation via compositions
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019. ISSN 0893-6080. DOI: 10.1016/j.neunet.2019.07.011
-
[35]
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons.Communications in Computational Physics, 28(5):1768–1811, 2020. ISSN 1991-7120. DOI: 10.4208/cicp.OA-2020-0149
-
[36]
URLhttps://doi.org/10.1162/neco_a_01178
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network with approximation error being reciprocal of width to power of square root of depth.Neural Computation, 33(4): 1005–1036, 03 2021. ISSN 0899-7667. DOI: 10.1162/neco a 01364
-
[37]
Neural network approximation: Three hidden layers are enough.Neural Networks, 141:160–173, 2021
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: Three hidden layers are enough.Neural Networks, 141:160–173, 2021. ISSN 0893-6080. DOI: 10.1016/j.neunet.2021.04.011
-
[38]
Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23(276):1–60, 2022
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons.Journal of Machine Learning Research, 23(276):1–60, 2022. URL:http://jmlr.org/papers/v23/21-1404.html
2022
-
[39]
Deep network approximation in terms of intrinsic parameters
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation in terms of intrinsic parameters. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 19909– 19934. PM...
2022
-
[40]
Neural network architecture beyond width and depth
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network architecture beyond width and depth. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5669–5681. Curran Associates, Inc., 2022. URL:https://proceedings.neurips.cc/paper_files/p aper/2022/hash/257be12f...
2022
-
[41]
Zuowei Shen, Haizhao Yang, and Shijun Zhang. Optimal approximation rate of ReLU networks in terms of width and depth.Journal de Math´ ematiques Pures et Appliqu´ ees, 157:101–135, 2022. ISSN 0021-7824. DOI: 10.1016/j.matpur.2021.07.009
-
[42]
Jonathan W. Siegel and Jinchao Xu. High-order approximation rates for shallow neural net- works with cosine and ReLU k activation functions.Applied and Computational Harmonic Analysis, 58:1–26, 2022. ISSN 1063-5203. DOI: 10.1016/j.acha.2021.12.005
-
[43]
Implicit neural representations with periodic activation functions
Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wet- zstein. Implicit neural representations with periodic activation functions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa- tion Processing Systems, volume 33, pages 7462–7473. Curran Associates, Inc., 2020. URL: https:...
2020
-
[44]
Fourier features let networks learn high frequency functions in low dimensional domains,
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processin...
-
[45]
Don’t fear peculiar activation functions: EUAF and beyond.Neural Networks, 186:107258, 2025
Qianchao Wang, Shijun Zhang, Dong Zeng, Zhaoheng Xie, Hengtao Guo, Tieyong Zeng, and Feng-Lei Fan. Don’t fear peculiar activation functions: EUAF and beyond.Neural Networks, 186:107258, 2025. ISSN 0893-6080. DOI: 10.1016/j.neunet.2025.107258
-
[46]
Multi-grade deep learning.Communications on Applied Mathematics and Computation, 8(2):778–829, 2026
Yuesheng Xu. Multi-grade deep learning.Communications on Applied Mathematics and Computation, 8(2):778–829, 2026. ISSN 2661-8893. DOI: 10.1007/s42967-024-00474-y
-
[47]
Yuesheng Xu and Taishan Zeng. Multi-grade deep learning for partial differential equations with applications to the burgers equation.arXiv e-prints, art. arXiv:2309.07401, September
-
[48]
DOI: 10.48550/arXiv.2309.07401
-
[49]
Training behavior of deep neural network in frequency domain
Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. InNeural Information Processing: 26th International Con- ference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer, 2019
2019
-
[50]
Yunfei Yang, Zhen Li, and Yang Wang. Approximation in shift-invariant spaces with deep ReLU neural networks.Neural Networks, 153:269–281, 2022. ISSN 0893-6080. DOI: 10.1016/j.neunet.2022.06.013
-
[51]
Error Bounds for Approximations with Deep ReLU Networks,
Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks.Neural Networks, 94:103–114, 2017. ISSN 0893-6080. DOI: 10.1016/j.neunet.2017.07.002
-
[52]
Optimal approximation of continuous functions by very deep ReLU net- works
Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLU net- works. In S´ ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,Proceedings of the 31st Conference On Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 639–649. PMLR, 06–09 Jul 2018. URL:http://proceedings.mlr.pres s/v75/yarotsky18a.html
2018
-
[53]
Elementary superexpressive activations
Dmitry Yarotsky. Elementary superexpressive activations. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11932–11940. PMLR, 18– 24 Jul 2021. URL:https://proceedings.mlr.press/v139/yarotsky21a.html
2021
-
[54]
The phase diagram of approximation rates for deep neural networks
Dmitry Yarotsky and Anton Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 13005–13015. Curran Associates, Inc., 2020. URL:https://proceedings.neurips.cc/p aper/2020/file/979a3f14bae...
2020
-
[55]
Deep network approximation: Beyond ReLU to diverse activation functions.Journal of Machine Learning Research, 25(35):1–39,
Shijun Zhang, Jianfeng Lu, and Hongkai Zhao. Deep network approximation: Beyond ReLU to diverse activation functions.Journal of Machine Learning Research, 25(35):1–39,
-
[56]
URL:http://jmlr.org/papers/v25/23-0912.html
-
[57]
Shijun Zhang, Hongkai Zhao, Yimin Zhong, and Haomin Zhou. Fourier multi-component and multi-layer neural networks: Unlocking high-frequency potential.arXiv e-prints, art. arXiv:2502.18959, February 2025. DOI: 10.48550/arXiv.2502.18959
-
[58]
Multigrade neural network approximation
Shijun Zhang, Zuowei Shen, and Yuesheng Xu. Multigrade neural network approximation. arXiv e-prints, art. arXiv:2601.16884, January 2026. DOI: 10.48550/arXiv.2601.16884. 30
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.