pith. sign in

arxiv: 2305.02657 · v5 · submitted 2023-05-04 · 📊 stat.ML · cs.LG

On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains

Pith reviewed 2026-05-24 08:37 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords eigenvalue decay rateneural tangent kernelgeneral domainsminimax optimalitywide neural networksRKHSkernel regressioninterpolation space
0
0 comments X

The pith

A strategy determines eigenvalue decay rates for neural tangent kernels and related functions on arbitrary domains rather than the sphere.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to compute how eigenvalues decay for a broad class of kernels defined on general domains. This class includes neural tangent kernels from networks of different depths and activations. With the decay rates in hand, the authors establish that wide neural network training dynamics uniformly match those of neural tangent kernel regression. They further prove that wide networks attain minimax optimal performance when the underlying function belongs to the interpolation space of the NTK reproducing kernel Hilbert space. The work additionally shows that overfitted neural networks fail to generalize.

Core claim

We provide a strategy to determine the eigenvalue decay rate of a large class of kernel functions defined on a general domain rather than the sphere. This class includes but is not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions. After proving that the dynamics of training the wide neural networks uniformly approximated that of the neural tangent kernel regression on general domains, we can further illustrate the minimax optimality of the wide neural network provided that the underlying truth function f is in [H_NTK]^s, an interpolation space associated with the RKHS H_NTK of NTK. We also showed that the overfitted神经

What carries the argument

The strategy for determining eigenvalue decay rates of the class of kernels on general domains, extending the spherical case to arbitrary domains.

If this is right

  • Training dynamics of wide neural networks uniformly approximate neural tangent kernel regression on general domains.
  • Wide neural networks achieve minimax optimality when the true function lies in the interpolation space [H_NTK]^s of the NTK RKHS.
  • Overfitted neural networks cannot generalize well.
  • The eigenvalue decay rate approach applies to the full class of kernels including NTKs of different depths and activations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The domain-general strategy may allow similar decay-rate analysis for kernel methods outside neural-network settings.
  • Uniform approximation on arbitrary domains broadens the settings where NTK theory directly informs neural network behavior.
  • The failure of overfitted networks to generalize suggests examining regularization choices even when parameter count greatly exceeds sample size.

Load-bearing premise

The strategy for determining eigenvalue decay rates extends from the spherical case to arbitrary domains for the entire class of kernels that includes NTKs of varying depths and activations.

What would settle it

A concrete counterexample on a non-spherical domain where the computed eigenvalue decay rate deviates from the predicted rate, or where wide neural network training fails to uniformly approximate neural tangent kernel regression.

Figures

Figures reproduced from arXiv: 2305.02657 by Guhan Chen, Qian Lin, Yicheng Li, Zixiong Yu.

Figure 1
Figure 1. Figure 1: Eigenvalue decay of NTK under uniform distribution on [−1, 1]d , where i is se￾lected in [50, 200] and n = 1000. The dashed black line represents the log least￾square fit and the decay rates r are reported. Appendix C. Omitted proofs Proof of Proposition 13 Theorem 10 shows the eigenvalue decay rate of KNT is (d+1)/d. Therefore, the results in Lin et al. (2018) implies the lower rate and that the gradient … view at source ↗
read the original abstract

In this paper, we provide a strategy to determine the eigenvalue decay rate (EDR) of a large class of kernel functions defined on a general domain rather than $\mathbb S^{d}$. This class of kernel functions include but are not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions. After proving that the dynamics of training the wide neural networks uniformly approximated that of the neural tangent kernel regression on general domains, we can further illustrate the minimax optimality of the wide neural network provided that the underground truth function $f\in [\mathcal H_{\mathrm{NTK}}]^{s}$, an interpolation space associated with the RKHS $\mathcal{H}_{\mathrm{NTK}}$ of NTK. We also showed that the overfitted neural network can not generalize well. We believe our approach for determining the EDR of kernels might be also of independent interests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a strategy for determining eigenvalue decay rates (EDR) of a class of kernels including neural tangent kernels (NTKs) of varying depths and activations, defined on general domains rather than the sphere. It claims this EDR result implies that wide neural network training dynamics uniformly approximate NTK regression on general domains, that wide NNs achieve minimax optimality when the target lies in the interpolation space [H_NTK]^s, and that overfitted networks fail to generalize.

Significance. A domain-general EDR strategy for NTK-type kernels would enable spectral analysis of kernel regression and NN generalization beyond spheres, strengthening the link between NTK theory and minimax rates via interpolation spaces. The independent-interest claim for the EDR method is plausible if it relies on standard integral-operator spectral theory rather than sphere-specific tools, but the manuscript provides no explicit comparison to existing results on compact domains.

major comments (2)
  1. [Section on EDR strategy and its application to general domains] The central extension of the EDR strategy from the sphere to arbitrary domains is load-bearing for both the uniform approximation claim and the minimax optimality statement, yet the manuscript does not exhibit a domain-general proof that replaces spherical-harmonic or zonal-harmonic expansions with the spectral theory of the integral operator on L^2(Ω) for compact Ω with boundary regularity; without this, the decay rates used to define [H_NTK]^s are not guaranteed to hold uniformly.
  2. [Proof of uniform approximation of training dynamics] The claim that wide NN training dynamics uniformly approximate NTK regression on general domains (used to reach the optimality result) rests on the EDR rates; if those rates are only established under sphere-specific assumptions, the approximation statement fails to extend as stated.
minor comments (2)
  1. [Definition of interpolation spaces] Notation for the interpolation spaces [H_NTK]^s should be defined explicitly with reference to the RKHS norm and the eigenvalue decay, rather than left implicit.
  2. [Section on overfitted networks] The statement that overfitted networks cannot generalize well would benefit from a precise quantitative bound linking the EDR to the generalization gap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. Below we respond point-by-point to the major comments, clarifying the domain-general nature of the EDR strategy.

read point-by-point responses
  1. Referee: The central extension of the EDR strategy from the sphere to arbitrary domains is load-bearing for both the uniform approximation claim and the minimax optimality statement, yet the manuscript does not exhibit a domain-general proof that replaces spherical-harmonic or zonal-harmonic expansions with the spectral theory of the integral operator on L^2(Ω) for compact Ω with boundary regularity; without this, the decay rates used to define [H_NTK]^s are not guaranteed to hold uniformly.

    Authors: The EDR strategy in the manuscript is formulated directly via the spectral theory of the integral operator on L^2(Ω) for a general compact domain Ω (with the stated boundary regularity). No spherical or zonal harmonic expansions are employed; the decay rates follow from the assumed smoothness of the kernel and standard results on the eigenvalues of integral operators with continuous kernels. The relevant arguments appear in Sections 3–4 and apply uniformly to the class of kernels considered, including NTKs of varying depths. We will add an explicit remark in the introduction contrasting the approach with sphere-specific techniques to make this generality clearer. revision: partial

  2. Referee: The claim that wide NN training dynamics uniformly approximate NTK regression on general domains (used to reach the optimality result) rests on the EDR rates; if those rates are only established under sphere-specific assumptions, the approximation statement fails to extend as stated.

    Authors: Because the EDR rates are obtained from the domain-general integral-operator analysis described above, the subsequent uniform approximation of wide-NN training dynamics by NTK regression (Theorem 5.1) likewise holds on general domains. The approximation argument relies on the eigenvalue decay to control the RKHS norm and does not invoke sphere geometry. revision: no

Circularity Check

0 steps flagged

No circularity detected; derivation relies on claimed new EDR strategy for general domains.

full rationale

The abstract and provided excerpts describe a new strategy for eigenvalue decay rates on arbitrary domains (distinct from the sphere), followed by an approximation result for wide NN dynamics to NTK regression and a minimax optimality claim under an interpolation-space assumption. No equations, self-citations, or fitted quantities are exhibited that reduce any central prediction or uniqueness claim to its own inputs by construction. The extension to general domains is presented as the paper's contribution rather than imported via self-reference or ansatz smuggling. This is the expected non-finding for a paper whose core technical step is an independent methodological extension.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Ledger constructed from abstract claims only. The main unverified premise is that a single strategy covers the stated class of kernels including NTKs of arbitrary depth and activation on arbitrary domains.

axioms (1)
  • domain assumption The class of kernels includes NTKs for networks of different depths and activations, and the EDR strategy applies uniformly to this class on general domains.
    Invoked when extending spherical results to general domains and claiming uniform approximation of NN training dynamics.

pith-pipeline@v0.9.0 · 5691 in / 1324 out tokens · 19389 ms · 2026-05-24T08:37:28.704539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Dimensional Kernel Ridge Regression: Extending to Product Kernels

    stat.ML 2026-05 unverdicted novelty 7.0

    Extends high-dimensional KRR to product kernels, proving convergence rates that recover minimax optimality for source condition s ≤ 1, saturation for s > 1, and multiple-descent phenomena with respect to sample size n.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    A Convergence Theory for Deep Learning via Over-Parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. Advances in neural information processing systems , 32, 2019a. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization, June 2019b. URL http://arxiv.org/abs/1811.03962. Ingo Steinwart (auth.) And...

  2. [2]

    URL https://proceedings.neurips.cc/paper/2019/hash/ dbc4d84bfcfe2284ba11beffb853a8c4-Abstract.html

    Curran Asso- ciates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/hash/ dbc4d84bfcfe2284ba11beffb853a8c4-Abstract.html. D. Azevedo and V.A. Menegatto. Sharp estimates for eigenvalues of integral operators generated by dot product kernels on the sphere. Journal of Approximation Theory , 177: 57–68, January

  3. [3]

    doi: 10.1016/j.jat.2013.10.002

    ISSN 00219045. doi: 10.1016/j.jat.2013.10.002. 43 Li, Yu, Chen and Lin Peter L. Bartlett, Philip M. Long, G´ abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences , 117(48):30063– 30070,

  4. [4]

    Daniel Beaglehole, Mikhail Belkin, and Parthe Pandit

    doi: 10.1016/j.jco.2006.07.001. Daniel Beaglehole, Mikhail Belkin, and Parthe Pandit. Kernel ridgeless regression is incon- sistent in low dimensions, June

  5. [5]

    Deep equals shallow for ReLU networks in kernel regimes

    Alberto Bietti and Francis Bach. Deep equals shallow for ReLU networks in kernel regimes. arXiv preprint arXiv:2009.14397 ,

  6. [6]

    Lin Chen and Sheng Xu

    1007/s10208-006-0196-8. Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same RKHS. arXiv preprint arXiv:2009.10683 ,

  7. [7]

    Feng Dai and Yuan Xu

    URL https://proceedings.neurips.cc/paper/2009/file/ 5751ec3e9a4feab575962e78e006250d-Paper.pdf. Feng Dai and Yuan Xu. Approximation Theory and Harmonic Analysis on Spheres and Balls. Springer Monographs in Mathematics. Springer New York, New York, NY,

  8. [8]

    doi: 10.1007/978-1-4614-6660-4

    ISBN 978-1-4614-6659-8 978-1-4614-6660-4. doi: 10.1007/978-1-4614-6660-4. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding, May

  9. [9]

    neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf

    URL https://proceedings. neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for gen- erative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410,

  10. [10]

    45 Li, Yu, Chen and Lin Yicheng Li, Haobo Zhang, and Qian Lin

    URL https://proceedings.neurips.cc/ paper/2019/hash/0d1a9651497a38d8b1c3871c84528bd4-Abstract.html. 45 Li, Yu, Chen and Lin Yicheng Li, Haobo Zhang, and Qian Lin. Kernel interpolation generalizes poorly. Biometrika, page asad048, August 2023a. ISSN 0006-3444, 1464-3510. doi: 10.1093/ biomet/asad048. Yicheng Li, Haobo Zhang, and Qian Lin. On the saturation...

  11. [11]

    Andrea Montanari and Yiqiao Zhong

    doi: 10.1016/j.acha.2018.09.009. Andrea Montanari and Yiqiao Zhong. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics , 50(5): 2816–2847,

  12. [12]

    doi: 10.1090/simon/004

    ISBN 978-1-4704-1103-9 978-1-4704-2763-4. doi: 10.1090/simon/004. Ingo Steinwart and C. Scovel. Mercer’s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation , 35(3):363–417,

  13. [13]

    Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs

    doi: 10.1007/S00365-012-9153-3. 46 Eigenvalues of NTK on General Domains Namjoon Suh, Hyunouk Ko, and Xiaoming Huo. A non-parametric regression view- point: Generalization of overparametrized deep ReLU network under noisy observa- tions. In International Conference on Learning Representations , May

  14. [14]

    Introduction to the non-asymptotic analysis of random matrices

    URL https://openreview.net/forum?id=bZJbzaj_IlP. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027,

  15. [15]

    doi: 10.2307/1993907

    ISSN 0002-9947. doi: 10.2307/1993907. 47