pith. sign in

arxiv: 2512.20562 · v2 · submitted 2025-12-23 · 📊 stat.ML · cs.LG· math.OC

Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Pith reviewed 2026-05-16 19:59 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC
keywords shallow neural networksspherical polynomialschannel attentionfeature learninggradient descentsample complexityminimax optimalitynonparametric regression
0
0 comments X

The pith

Two-layer neural networks with learnable channel attention achieve the minimax optimal sample complexity n ~ d to the ell zero over epsilon for low-degree spherical polynomials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a carefully designed two-layer neural network with channel attention learns low-degree spherical polynomials of fixed degree ell zero on the unit sphere in d dimensions using the lowest possible number of samples. A first stage identifies the correct harmonic degree via a learnable channel selection process from L greater than or equal to ell zero channels, after which vanilla gradient descent trains the second layer. This yields regression risk of order d to the ell zero over n, which is sharp and matches the minimax rate for kernels of rank Theta of d to the ell zero. The bound improves on earlier results that carried extra logarithmic or inverse-epsilon factors and holds with finite network width.

Core claim

For any regression risk epsilon in (0,1), a two-layer NN with channel attention and finite width trained by vanilla GD requires sample complexity n as Theta of d to the ell zero over epsilon with high probability and renders nonparametric regression risk of order Theta of d to the ell zero over n that is minimax optimal, via a two-stage process in which a provable learnable channel selection algorithm identifies the ground-truth degree ell zero from L greater than or equal to ell zero channels before standard GD trains the second layer.

What carries the argument

The learnable channel selection algorithm, a harmonic-degree selection process that identifies the ground-truth channel number ell zero from the first-layer activation channels before second-layer training.

If this is right

  • The achieved sample complexity n as Theta of d to the ell zero over epsilon is the lowest possible with high probability.
  • The regression risk bound of Theta of d to the ell zero over n is sharp and matches the minimax lower bound.
  • The result holds for finite-width networks trained with vanilla gradient descent.
  • Feature learning through the channel attention step is what removes the extra log d and epsilon to the minus two factors present in prior bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar explicit selection mechanisms may allow overparameterized networks to reach optimality in other high-dimensional nonparametric regression tasks.
  • The separation into a selection stage followed by gradient descent suggests that pure end-to-end training without selection may not suffice for minimax rates on polynomial classes.
  • The approach could be tested on related function classes such as low-degree polynomials on the hypercube or other compact manifolds.

Load-bearing premise

The network must contain a provable learnable channel selection algorithm that correctly identifies the ground-truth degree ell zero from the available channels before the second stage of gradient descent.

What would settle it

An experiment showing that removing the channel selection step forces the sample complexity back to Theta of d to the ell zero times max of epsilon to the minus two or log d, or produces risk strictly worse than Theta of d to the ell zero over n.

read the original abstract

We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\eps)$ with high probability, in contrast with the representative sample complexity $\Theta\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function, $\ell_0$, from $L \ge \ell_0$ channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies learning a low-degree spherical polynomial of fixed degree ℓ₀ = Θ(1) on the unit sphere in ℝ^d via an over-parameterized two-layer neural network equipped with a learnable channel attention mechanism. It proposes a two-stage procedure: (1) a provable channel-selection step that identifies the ground-truth degree ℓ₀ from L ≥ ℓ₀ channels, followed by (2) vanilla gradient descent on the second-layer weights using only the selected channels. The central claim is that this procedure achieves sample complexity n ≍ Θ(d^{ℓ₀}/ε) for any regression risk ε ∈ (0,1) with high probability, yielding a sharp nonparametric risk of Θ(d^{ℓ₀}/n) that matches the minimax rate for a kernel of rank Θ(d^{ℓ₀}).

Significance. If the proofs are complete, the result would be a meaningful advance in the theory of feature learning for shallow networks: it supplies the first finite-width, vanilla-GD guarantee that attains the minimax rate for this function class, explicitly separating the benefit of learnable channel attention from kernel methods that incur extra max(ε^{-2}, log d) factors. The explicit two-stage construction and the matching lower bound are strengths that could influence subsequent work on attention-based feature selection.

major comments (2)
  1. [§3.2 and Theorem 4.1] §3.2 (Channel Selection Analysis) and Theorem 4.1: The headline sample-complexity claim n ≍ Θ(d^{ℓ₀}/ε) requires that the learnable channel-selection step succeeds with probability 1−o(1) at exactly this n. The concentration argument for the attention weights (Eq. (8) and the subsequent union bound over L channels) appears to produce a failure probability that decays only after an extra Ω(log(d/δ)) factor in n; setting δ = Θ(ε) to control the final risk therefore re-introduces a logarithmic term that the abstract contrasts against the prior Θ(d^{ℓ₀} max{ε^{-2}, log d}) bound.
  2. [§4.3] §4.3 (Risk Decomposition): The sharp rate Θ(d^{ℓ₀}/n) for the trained network is obtained by showing that, conditional on correct channel selection, the second-stage GD error is O(d^{ℓ₀}/n). However, the decomposition does not quantify the probability that selection fails on a constant fraction of draws; if this probability is Ω(ε), the unconditional risk bound reverts to the slower rate the paper claims to improve upon.
minor comments (2)
  1. [§2] §2 (Notation): The definition of the spherical-harmonic basis and the precise parameterization of the channel attention weights (Eq. (3)) would benefit from an explicit statement of the normalization constants and the range of the learnable scalars.
  2. [Figure 1 and Algorithm 1] Figure 1 and Algorithm 1: The schematic of the two-stage procedure is clear, but the caption should state the precise dependence of the width m and the number of channels L on d and ℓ₀ that is used in the theorems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below, providing clarifications on the concentration and risk bounds. We will incorporate explicit remarks into the revised manuscript.

read point-by-point responses
  1. Referee: [§3.2 and Theorem 4.1] §3.2 (Channel Selection Analysis) and Theorem 4.1: The headline sample-complexity claim n ≍ Θ(d^{ℓ₀}/ε) requires that the learnable channel-selection step succeeds with probability 1−o(1) at exactly this n. The concentration argument for the attention weights (Eq. (8) and the subsequent union bound over L channels) appears to produce a failure probability that decays only after an extra Ω(log(d/δ)) factor in n; setting δ = Θ(ε) to control the final risk therefore re-introduces a logarithmic term that the abstract contrasts against the prior Θ(d^{ℓ₀} max{ε^{-2}, log d}) bound.

    Authors: We appreciate the referee highlighting this detail. The concentration inequality for the attention weights (Eq. (8)) is sub-Gaussian and yields a per-channel failure probability of exp(−Ω(n / d^{ℓ₀})). Substituting the claimed n ≍ Θ(d^{ℓ₀}/ε) produces failure probability exp(−Ω(1/ε)), which is o(ε^k) for any fixed k. Because ℓ₀ = Θ(1) we may take L = O(1) (or at most polylog(d) without changing the leading term), so the union bound over L channels remains exponentially small in 1/ε and does not require any extra logarithmic factor in n. We will add a short paragraph after Eq. (8) making this exponential decay explicit and confirming that the sample complexity remains Θ(d^{ℓ₀}/ε) with probability 1−o(1). revision: partial

  2. Referee: [§4.3] §4.3 (Risk Decomposition): The sharp rate Θ(d^{ℓ₀}/n) for the trained network is obtained by showing that, conditional on correct channel selection, the second-stage GD error is O(d^{ℓ₀}/n). However, the decomposition does not quantify the probability that selection fails on a constant fraction of draws; if this probability is Ω(ε), the unconditional risk bound reverts to the slower rate the paper claims to improve upon.

    Authors: We thank the referee for this observation. The failure probability of the channel-selection step is exp(−Ω(1/ε)) as established in §3.2. Consequently, the contribution of any failure event to the unconditional risk is at most exp(−Ω(1/ε)) ⋅ O(1) = o(ε). Adding this term to the conditional risk O(d^{ℓ₀}/n) = O(ε) does not change the leading Θ(d^{ℓ₀}/n) rate. We will revise the risk decomposition in §4.3 to state the unconditional bound explicitly, thereby confirming that the claimed minimax-optimal rate holds with high probability. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no reduction to inputs by construction

full rationale

The paper presents a two-stage procedure consisting of a learnable channel selection algorithm that identifies ell_0 from L channels, followed by GD on the second layer. The claimed sample complexity n ~ Theta(d^{ell_0}/eps) and the resulting nonparametric risk bound Theta(d^{ell_0}/n) are stated as consequences of this procedure being provable at the given n, with the minimax optimality comparison drawn externally to kernel methods of rank Theta(d^{ell_0}). No equation or step reduces the final rate to a fitted parameter, a self-defined quantity, or a self-citation chain by construction; the selection success is asserted as an internal proof obligation rather than an assumption that tautologically encodes the target bound. The derivation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a provable learnable channel-selection algorithm that correctly identifies ell_0 from L channels and on standard convergence assumptions for gradient descent on the selected channels; no explicit free parameters or new invented entities are named in the abstract.

axioms (1)
  • domain assumption Gradient descent on the second-layer weights after channel selection converges to the minimax rate under the stated over-parameterization
    Invoked in the description of the two-stage training procedure
invented entities (1)
  • learnable channel attention mechanism no independent evidence
    purpose: To perform harmonic-degree selection that identifies the ground-truth degree ell_0
    Introduced as the key architectural component enabling the improved sample complexity

pith-pipeline@v0.9.0 · 5673 in / 1450 out tokens · 18943 ms · 2026-05-16T19:59:49.589639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Deep learning,

    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, pp. 436–444, 2015

  2. [2]

    Gradient descent provably optimizes over- parameterized neural networks,

    S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over- parameterized neural networks,” inInternational Conference on Learning Representations, 2019

  3. [3]

    A convergence theory for deep learning via over- parameterization,

    Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over- parameterization,” inInternational Conference on Machine Learning, ser. Proceedings of Ma- chine Learning Research, vol. 97. PMLR, 2019, pp. 242–252

  4. [4]

    Gradient descent finds global minima of deep neural networks,

    S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 1675–1685

  5. [5]

    Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,

    S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang, “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 322–332

  6. [6]

    An improved analysis of training over-parameterized deep neural net- works,

    D. Zou and Q. Gu, “An improved analysis of training over-parameterized deep neural net- works,” inAdvances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 2053–2062

  7. [7]

    On learning over-parameterized neural networks: A functional approxima- tion perspective,

    L. Su and P. Yang, “On learning over-parameterized neural networks: A functional approxima- tion perspective,” inAdvances in Neural Information Processing Systems, 2019, pp. 2637–2646

  8. [8]

    Neural tangent kernel: Convergence and generalization in neural networks,

    A. Jacot, C. Hongler, and F. Gabriel, “Neural tangent kernel: Convergence and generalization in neural networks,” inAdvances in Neural Information Processing Systems, S. Bengio, H. M. 40 Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 8580– 8589

  9. [9]

    Tensor programs IV: feature learning in infinite-width neural networks,

    G. Yang and E. J. Hu, “Tensor programs IV: feature learning in infinite-width neural networks,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18- 24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 11727–11737

  10. [10]

    Generalization bounds of stochastic gradient descent for wide and deep neural networks,

    Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” inAdvances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 10835–10845

  11. [11]

    Linearizedtwo-layersneuralnetworks in high dimension,

    B.Ghorbani, S.Mei, T.Misiakiewicz, andA.Montanari, “Linearizedtwo-layersneuralnetworks in high dimension,”Ann. Statist., vol. 49, no. 2, pp. 1029 – 1054, 2021

  12. [12]

    On the spectral bias of neural networks,

    N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville, “On the spectral bias of neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 5301–5310

  13. [13]

    Towards understanding the spectral bias of deep learning,

    Y. Cao, Z. Fang, Y. Wu, D. Zhou, and Q. Gu, “Towards understanding the spectral bias of deep learning,” inInternational Joint Conference on Artificial Intelligence, Z. Zhou, Ed. ijcai.org, 2021, pp. 2205–2211

  14. [14]

    The spectral biasof polynomial neural networks,

    M.Choraria, L.T. Dadi, G.Chrysos, J.Mairal, andV. Cevher, “The spectral biasof polynomial neural networks,” inInternational Conference on Learning Representations. OpenReview.net, 2022

  15. [15]

    Beyond linearization: On quadratic and higher-order approximation of wide neural networks,

    Y. Bai and J. D. Lee, “Beyond linearization: On quadratic and higher-order approximation of wide neural networks,” inInternational Conference on Learning Representations. OpenRe- view.net, 2020

  16. [16]

    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials,

    E. Nichani, Y. Bai, and J. D. Lee, “Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mo- ha...

  17. [17]

    Neural networks can learn representations with gradient descent,

    A. Damian, J. D. Lee, and M. Soltanolkotabi, “Neural networks can learn representations with gradient descent,” inConference on Learning Theory, 2-5 July 2022, London, UK, ser. Proceedings of Machine Learning Research, P. Loh and M. Raginsky, Eds., vol. 178. PMLR, 2022, pp. 5413–5452

  18. [18]

    Mean-field analysis on two-layer neural networks from a kernel per- spective,

    S. Takakura and T. Suzuki, “Mean-field analysis on two-layer neural networks from a kernel per- spective,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

  19. [19]

    Regularization matters: A nonparametric perspective on overparametrized neural network,

    T. Hu, W. Wang, C. Lin, and G. Cheng, “Regularization matters: A nonparametric perspective on overparametrized neural network,” inInternational Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Banerjee and K. Fukumizu, Eds., vol. 130. PMLR, 2021, pp. 829–837. 41

  20. [20]

    A non-parametric regression viewpoint : Generalization of over- parametrized deep RELU network under noisy observations,

    N. Suh, H. Ko, and X. Huo, “A non-parametric regression viewpoint : Generalization of over- parametrized deep RELU network under noisy observations,” inThe Tenth International Con- ference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenRe- view.net, 2022

  21. [21]

    On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains,

    Y. Li, Z. Yu, G. Chen, and Q. Lin, “On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains,”Journal of Machine Learning Research, vol. 25, no. 82, pp. 1–47, 2024

  22. [22]

    Gradient descent finds over-parameterized neural networks with sharp generalization for nonparametric regression,

    Y. Yang and P. Li, “Gradient descent finds over-parameterized neural networks with sharp generalization for nonparametric regression,”arXiv preprint arXiv:2411.02904, 2024. [Online]. Available: https://arxiv.org/abs/2411.02904

  23. [23]

    Sharp generalization for nonparametric regression by over-parameterized neural net- works: A distribution-free analysis in spherical covariate,

    Y. Yang, “Sharp generalization for nonparametric regression by over-parameterized neural net- works: A distribution-free analysis in spherical covariate,” inInternational Conference on Ma- chine Learning (ICML), 2025

  24. [24]

    Minimax-optimal rates for sparse additive models over kernel classes via convex programming,

    G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax-optimal rates for sparse additive models over kernel classes via convex programming,”J. Mach. Learn. Res., vol. 13, pp. 389–427, 2012

  25. [25]

    Information theoretic bounds for compressed sensing,

    S. Aeron, V. Saligrama, and M. Zhao, “Information theoretic bounds for compressed sensing,” IEEE Transactions on Information Theory, vol. 56, no. 10, pp. 5111–5130, 2010

  26. [26]

    Additive Regression and Other Nonparametric Models,

    C. J. Stone, “Additive Regression and Other Nonparametric Models,”Ann. Statist., vol. 13, no. 2, pp. 689 – 705, 1985

  27. [27]

    Information-theoreticdeterminationofminimaxratesofconvergence,

    Y.YangandA.Barron, “Information-theoreticdeterminationofminimaxratesofconvergence,” Ann. Statist., vol. 27, no. 5, pp. 1564 – 1599, 1999

  28. [28]

    Early stopping and non-parametric regression: an optimal data-dependent stopping rule,

    G. Raskutti, M. J. Wainwright, and B. Yu, “Early stopping and non-parametric regression: an optimal data-dependent stopping rule,”J. Mach. Learn. Res., vol. 15, no. 1, pp. 335–366, 2014

  29. [29]

    Minimax optimal rates of estimation in high dimensional additive models,

    M. Yuan and D.-X. Zhou, “Minimax optimal rates of estimation in high dimensional additive models,”Ann. Statist., vol. 44, no. 6, pp. 2564 – 2593, 2016

  30. [30]

    Local rademacher complexities,

    P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,”Ann. Statist., vol. 33, no. 4, pp. 1497–1537, 08 2005

  31. [31]

    Local rademacher complexities and oracle inequalities in risk minimization,

    V. Koltchinskii, “Local rademacher complexities and oracle inequalities in risk minimization,” Ann. Statist., vol. 34, no. 6, pp. 2593–2656, 12 2006

  32. [32]

    Geometric parameters of kernel machines,

    S. Mendelson, “Geometric parameters of kernel machines,” inConference on Computational Learning Theory, ser. Lecture Notes in Computer Science, J. Kivinen and R. H. Sloan, Eds., vol. 2375. Springer, 2002, pp. 29–43

  33. [33]

    Boston, MA: Birkhäuser Boston, 1992, pp

    I.Pinelis,An Approach to Inequalities for the Distributions of Infinite-Dimensional Martingales. Boston, MA: Birkhäuser Boston, 1992, pp. 128–134

  34. [34]

    Ledoux,Probability in Banach Spaces [electronic resource] : Isoperimetry and Processes / by Michel Ledoux, Michel Talagrand., 1st ed., ser

    M. Ledoux,Probability in Banach Spaces [electronic resource] : Isoperimetry and Processes / by Michel Ledoux, Michel Talagrand., 1st ed., ser. Classics in Mathematics. Berlin, Heidelberg: Springer Berlin Heidelberg, 1991

  35. [35]

    Chihara,An Introduction to Orthogonal Polynomials, ser

    T. Chihara,An Introduction to Orthogonal Polynomials, ser. Dover Books on Mathematics. Dover Publications, 2011. 42

  36. [36]

    Efthimiou and C

    C. Efthimiou and C. Frye,Spherical Harmonics in p Dimensions. World Scientific Co., 2014

  37. [37]

    Szegő,Orthogonal Polynomials, ser

    G. Szegő,Orthogonal Polynomials, ser. American Math. Soc: Colloquium publ. Amer. Math. Soc., 1975

  38. [38]

    Breaking the curse of dimensionality with convex neural networks,

    F. R. Bach, “Breaking the curse of dimensionality with convex neural networks,”J. Mach. Learn. Res., vol. 18, pp. 19:1–19:53, 2017

  39. [39]

    On the inductive bias of neural tangent kernels,

    A. Bietti and J. Mairal, “On the inductive bias of neural tangent kernels,” inAdvances in Neural Information Processing Systems, H.M.Wallach, H.Larochelle, A.Beygelzimer, F.d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 12873–12884

  40. [40]

    Basics of harmonic polynomials and spherical functions,

    N. V. Krylov, “Basics of harmonic polynomials and spherical functions,” Tech. Rep. [Online]. Available: https://www-users.cse.umn.edu/~nkrylov/Moscow_2019_Sphrcal.pdf

  41. [41]

    A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric,

    F. T. Wright, “A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric,”Ann. Probab., vol. 1, no. 6, pp. 1068 – 1070, 1973

  42. [42]

    An introduction to matrix concentration inequalities,

    J. A. Tropp, “An introduction to matrix concentration inequalities,”Foundations and Trends® in Machine Learning, vol. 8, no. 1-2, pp. 1–230, 2015

  43. [43]

    Dual attention network for scene segmentation,

    J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 3146–3154

  44. [44]

    Eca-net: Efficient channel attention for deep convolutional neural networks,

    Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 11531–11539

  45. [45]

    Xcit: Cross-covariance image transformers,

    A. Ali, H. Touvron, M. Caron, P. Bojanowski, M. Douze, A. Joulin, I. Laptev, N. Neverova, G. Synnaeve, J. Verbeek, and H. Jégou, “Xcit: Cross-covariance image transformers,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Infor- mation Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A....

  46. [46]

    Understanding matrix function normalizations incovariancepoolingthroughthelensofriemanniangeometry,

    Z. Chen, Y. Song, X. Wu, G. Liu, and N. Sebe, “Understanding matrix function normalizations incovariancepoolingthroughthelensofriemanniangeometry,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025

  47. [47]

    Why approximate matrix square root outperforms accurate SVD in global covariance pooling?

    Y. Song, N. Sebe, and W. Wang, “Why approximate matrix square root outperforms accurate SVD in global covariance pooling?” in2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 1095–1103

  48. [48]

    Towards a deeper understanding of global covariance pooling in deep learning: An optimization perspective,

    Q. Wang, Z. Zhang, M. Gao, J. Xie, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Towards a deeper understanding of global covariance pooling in deep learning: An optimization perspective,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 15802–15819, 2023. 43

  49. [49]

    Rethinking attention with performers,

    K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller, “Rethinking attention with performers,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

  50. [50]

    Random feature attention,

    H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong, “Random feature attention,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

  51. [51]

    Efficient attention via control variates,

    L. Zheng, J. Yuan, C. Wang, and L. Kong, “Efficient attention via control variates,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  52. [52]

    Infinite attention: NNGP and NTK for deep attention networks,

    J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak, “Infinite attention: NNGP and NTK for deep attention networks,” inProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 4376–4386

  53. [53]

    Transformers are minimax optimal nonparametric in- context learners,

    J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in- context learners,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 44