Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Yingzhen Yang

arxiv: 2512.20562 · v2 · submitted 2025-12-23 · 📊 stat.ML · cs.LG· math.OC

Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Yingzhen Yang This is my paper

Pith reviewed 2026-05-16 19:59 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC

keywords shallow neural networksspherical polynomialschannel attentionfeature learninggradient descentsample complexityminimax optimalitynonparametric regression

0 comments

The pith

Two-layer neural networks with learnable channel attention achieve the minimax optimal sample complexity n ~ d to the ell zero over epsilon for low-degree spherical polynomials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a carefully designed two-layer neural network with channel attention learns low-degree spherical polynomials of fixed degree ell zero on the unit sphere in d dimensions using the lowest possible number of samples. A first stage identifies the correct harmonic degree via a learnable channel selection process from L greater than or equal to ell zero channels, after which vanilla gradient descent trains the second layer. This yields regression risk of order d to the ell zero over n, which is sharp and matches the minimax rate for kernels of rank Theta of d to the ell zero. The bound improves on earlier results that carried extra logarithmic or inverse-epsilon factors and holds with finite network width.

Core claim

For any regression risk epsilon in (0,1), a two-layer NN with channel attention and finite width trained by vanilla GD requires sample complexity n as Theta of d to the ell zero over epsilon with high probability and renders nonparametric regression risk of order Theta of d to the ell zero over n that is minimax optimal, via a two-stage process in which a provable learnable channel selection algorithm identifies the ground-truth degree ell zero from L greater than or equal to ell zero channels before standard GD trains the second layer.

What carries the argument

The learnable channel selection algorithm, a harmonic-degree selection process that identifies the ground-truth channel number ell zero from the first-layer activation channels before second-layer training.

If this is right

The achieved sample complexity n as Theta of d to the ell zero over epsilon is the lowest possible with high probability.
The regression risk bound of Theta of d to the ell zero over n is sharp and matches the minimax lower bound.
The result holds for finite-width networks trained with vanilla gradient descent.
Feature learning through the channel attention step is what removes the extra log d and epsilon to the minus two factors present in prior bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar explicit selection mechanisms may allow overparameterized networks to reach optimality in other high-dimensional nonparametric regression tasks.
The separation into a selection stage followed by gradient descent suggests that pure end-to-end training without selection may not suffice for minimax rates on polynomial classes.
The approach could be tested on related function classes such as low-degree polynomials on the hypercube or other compact manifolds.

Load-bearing premise

The network must contain a provable learnable channel selection algorithm that correctly identifies the ground-truth degree ell zero from the available channels before the second stage of gradient descent.

What would settle it

An experiment showing that removing the channel selection step forces the sample complexity back to Theta of d to the ell zero times max of epsilon to the minus two or log d, or produces risk strictly worse than Theta of d to the ell zero over n.

read the original abstract

We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\eps)$ with high probability, in contrast with the representative sample complexity $\Theta\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function, $\ell_0$, from $L \ge \ell_0$ channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first finite-width proof that a two-layer net with learnable channel attention reaches the minimax rate Theta(d^ell0 / n) for low-degree spherical polynomials via a two-stage selection-plus-GD procedure.

read the letter

The main thing to know is that this work claims a finite-width two-layer network with explicit channel attention can learn low-degree spherical polynomials at the optimal sample complexity n ~ Theta(d^ell0 / eps) and achieve the matching risk bound Theta(d^ell0 / n). That rate is minimax optimal for kernels of rank Theta(d^ell0), and the paper positions it as better than earlier finite-width bounds that carried extra max(1/eps^2, log d) factors. The construction uses a two-stage process: a learnable selection step that identifies the ground-truth degree ell0 from L >= ell0 channels, followed by vanilla GD on the second layer using only the selected channels. If the selection step holds with high probability at the stated sample size, the result is genuinely new for feature-learning analyses in this setting. The paper does a clean job spelling out the contrast with prior sample complexities and tying the achieved rate to the kernel lower bound. The spherical-harmonics setup and the explicit attention mechanism are handled directly, which makes the argument easier to follow than many abstract feature-learning claims. The soft spot sits in the channel-selection analysis. The abstract asserts it is provable and succeeds at the target n, but the stress-test concern is reasonable: if the high-probability guarantee for correct selection carries hidden logarithmic or 1/eps factors, or requires stronger initialization assumptions than stated, the overall sample complexity reverts to the worse rate the paper contrasts against. The GD convergence part on the selected features looks standard and should be fine once selection succeeds, but the selection step is load-bearing. The math and citation pattern appear solid on the surface, with no obvious circularity in the rate comparison. This paper is for people working on nonparametric regression theory, spherical harmonics, or the gap between infinite-width kernels and finite-width feature learning. A reader who cares about concrete attention mechanisms in high-dimensional regression will find the explicit construction useful. It deserves serious referee time because the claimed improvement is sharp and the two-stage procedure is spelled out clearly enough to check.

Referee Report

2 major / 2 minor

Summary. The manuscript studies learning a low-degree spherical polynomial of fixed degree ℓ₀ = Θ(1) on the unit sphere in ℝ^d via an over-parameterized two-layer neural network equipped with a learnable channel attention mechanism. It proposes a two-stage procedure: (1) a provable channel-selection step that identifies the ground-truth degree ℓ₀ from L ≥ ℓ₀ channels, followed by (2) vanilla gradient descent on the second-layer weights using only the selected channels. The central claim is that this procedure achieves sample complexity n ≍ Θ(d^{ℓ₀}/ε) for any regression risk ε ∈ (0,1) with high probability, yielding a sharp nonparametric risk of Θ(d^{ℓ₀}/n) that matches the minimax rate for a kernel of rank Θ(d^{ℓ₀}).

Significance. If the proofs are complete, the result would be a meaningful advance in the theory of feature learning for shallow networks: it supplies the first finite-width, vanilla-GD guarantee that attains the minimax rate for this function class, explicitly separating the benefit of learnable channel attention from kernel methods that incur extra max(ε^{-2}, log d) factors. The explicit two-stage construction and the matching lower bound are strengths that could influence subsequent work on attention-based feature selection.

major comments (2)

[§3.2 and Theorem 4.1] §3.2 (Channel Selection Analysis) and Theorem 4.1: The headline sample-complexity claim n ≍ Θ(d^{ℓ₀}/ε) requires that the learnable channel-selection step succeeds with probability 1−o(1) at exactly this n. The concentration argument for the attention weights (Eq. (8) and the subsequent union bound over L channels) appears to produce a failure probability that decays only after an extra Ω(log(d/δ)) factor in n; setting δ = Θ(ε) to control the final risk therefore re-introduces a logarithmic term that the abstract contrasts against the prior Θ(d^{ℓ₀} max{ε^{-2}, log d}) bound.
[§4.3] §4.3 (Risk Decomposition): The sharp rate Θ(d^{ℓ₀}/n) for the trained network is obtained by showing that, conditional on correct channel selection, the second-stage GD error is O(d^{ℓ₀}/n). However, the decomposition does not quantify the probability that selection fails on a constant fraction of draws; if this probability is Ω(ε), the unconditional risk bound reverts to the slower rate the paper claims to improve upon.

minor comments (2)

[§2] §2 (Notation): The definition of the spherical-harmonic basis and the precise parameterization of the channel attention weights (Eq. (3)) would benefit from an explicit statement of the normalization constants and the range of the learnable scalars.
[Figure 1 and Algorithm 1] Figure 1 and Algorithm 1: The schematic of the two-stage procedure is clear, but the caption should state the precise dependence of the width m and the number of channels L on d and ℓ₀ that is used in the theorems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below, providing clarifications on the concentration and risk bounds. We will incorporate explicit remarks into the revised manuscript.

read point-by-point responses

Referee: [§3.2 and Theorem 4.1] §3.2 (Channel Selection Analysis) and Theorem 4.1: The headline sample-complexity claim n ≍ Θ(d^{ℓ₀}/ε) requires that the learnable channel-selection step succeeds with probability 1−o(1) at exactly this n. The concentration argument for the attention weights (Eq. (8) and the subsequent union bound over L channels) appears to produce a failure probability that decays only after an extra Ω(log(d/δ)) factor in n; setting δ = Θ(ε) to control the final risk therefore re-introduces a logarithmic term that the abstract contrasts against the prior Θ(d^{ℓ₀} max{ε^{-2}, log d}) bound.

Authors: We appreciate the referee highlighting this detail. The concentration inequality for the attention weights (Eq. (8)) is sub-Gaussian and yields a per-channel failure probability of exp(−Ω(n / d^{ℓ₀})). Substituting the claimed n ≍ Θ(d^{ℓ₀}/ε) produces failure probability exp(−Ω(1/ε)), which is o(ε^k) for any fixed k. Because ℓ₀ = Θ(1) we may take L = O(1) (or at most polylog(d) without changing the leading term), so the union bound over L channels remains exponentially small in 1/ε and does not require any extra logarithmic factor in n. We will add a short paragraph after Eq. (8) making this exponential decay explicit and confirming that the sample complexity remains Θ(d^{ℓ₀}/ε) with probability 1−o(1). revision: partial
Referee: [§4.3] §4.3 (Risk Decomposition): The sharp rate Θ(d^{ℓ₀}/n) for the trained network is obtained by showing that, conditional on correct channel selection, the second-stage GD error is O(d^{ℓ₀}/n). However, the decomposition does not quantify the probability that selection fails on a constant fraction of draws; if this probability is Ω(ε), the unconditional risk bound reverts to the slower rate the paper claims to improve upon.

Authors: We thank the referee for this observation. The failure probability of the channel-selection step is exp(−Ω(1/ε)) as established in §3.2. Consequently, the contribution of any failure event to the unconditional risk is at most exp(−Ω(1/ε)) ⋅ O(1) = o(ε). Adding this term to the conditional risk O(d^{ℓ₀}/n) = O(ε) does not change the leading Θ(d^{ℓ₀}/n) rate. We will revise the risk decomposition in §4.3 to state the unconditional bound explicitly, thereby confirming that the claimed minimax-optimal rate holds with high probability. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no reduction to inputs by construction

full rationale

The paper presents a two-stage procedure consisting of a learnable channel selection algorithm that identifies ell_0 from L channels, followed by GD on the second layer. The claimed sample complexity n ~ Theta(d^{ell_0}/eps) and the resulting nonparametric risk bound Theta(d^{ell_0}/n) are stated as consequences of this procedure being provable at the given n, with the minimax optimality comparison drawn externally to kernel methods of rank Theta(d^{ell_0}). No equation or step reduces the final rate to a fitted parameter, a self-defined quantity, or a self-citation chain by construction; the selection success is asserted as an internal proof obligation rather than an assumption that tautologically encodes the target bound. The derivation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a provable learnable channel-selection algorithm that correctly identifies ell_0 from L channels and on standard convergence assumptions for gradient descent on the selected channels; no explicit free parameters or new invented entities are named in the abstract.

axioms (1)

domain assumption Gradient descent on the second-layer weights after channel selection converges to the minimax rate under the stated over-parameterization
Invoked in the description of the two-stage training procedure

invented entities (1)

learnable channel attention mechanism no independent evidence
purpose: To perform harmonic-degree selection that identifies the ground-truth degree ell_0
Introduced as the key architectural component enabling the improved sample complexity

pith-pipeline@v0.9.0 · 5673 in / 1450 out tokens · 18943 ms · 2026-05-16T19:59:49.589639+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

minimax optimal rate for the regression risk with a kernel of rank Θ(d^{ℓ₀}) is Θ(d^{ℓ₀}/n)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

activation function σ_τ(x,x') = Σ_ℓ τ_ℓ P^{(d)}_ℓ(⟨x,x'⟩)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

Deep learning,

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, pp. 436–444, 2015

work page 2015
[2]

Gradient descent provably optimizes over- parameterized neural networks,

S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over- parameterized neural networks,” inInternational Conference on Learning Representations, 2019

work page 2019
[3]

A convergence theory for deep learning via over- parameterization,

Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over- parameterization,” inInternational Conference on Machine Learning, ser. Proceedings of Ma- chine Learning Research, vol. 97. PMLR, 2019, pp. 242–252

work page 2019
[4]

Gradient descent finds global minima of deep neural networks,

S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 1675–1685

work page 2019
[5]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,

S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang, “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 322–332

work page 2019
[6]

An improved analysis of training over-parameterized deep neural net- works,

D. Zou and Q. Gu, “An improved analysis of training over-parameterized deep neural net- works,” inAdvances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 2053–2062

work page 2019
[7]

On learning over-parameterized neural networks: A functional approxima- tion perspective,

L. Su and P. Yang, “On learning over-parameterized neural networks: A functional approxima- tion perspective,” inAdvances in Neural Information Processing Systems, 2019, pp. 2637–2646

work page 2019
[8]

Neural tangent kernel: Convergence and generalization in neural networks,

A. Jacot, C. Hongler, and F. Gabriel, “Neural tangent kernel: Convergence and generalization in neural networks,” inAdvances in Neural Information Processing Systems, S. Bengio, H. M. 40 Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 8580– 8589

work page 2018
[9]

Tensor programs IV: feature learning in infinite-width neural networks,

G. Yang and E. J. Hu, “Tensor programs IV: feature learning in infinite-width neural networks,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18- 24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 11727–11737

work page 2021
[10]

Generalization bounds of stochastic gradient descent for wide and deep neural networks,

Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” inAdvances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 10835–10845

work page 2019
[11]

Linearizedtwo-layersneuralnetworks in high dimension,

B.Ghorbani, S.Mei, T.Misiakiewicz, andA.Montanari, “Linearizedtwo-layersneuralnetworks in high dimension,”Ann. Statist., vol. 49, no. 2, pp. 1029 – 1054, 2021

work page 2021
[12]

On the spectral bias of neural networks,

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville, “On the spectral bias of neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 5301–5310

work page 2019
[13]

Towards understanding the spectral bias of deep learning,

Y. Cao, Z. Fang, Y. Wu, D. Zhou, and Q. Gu, “Towards understanding the spectral bias of deep learning,” inInternational Joint Conference on Artificial Intelligence, Z. Zhou, Ed. ijcai.org, 2021, pp. 2205–2211

work page 2021
[14]

The spectral biasof polynomial neural networks,

M.Choraria, L.T. Dadi, G.Chrysos, J.Mairal, andV. Cevher, “The spectral biasof polynomial neural networks,” inInternational Conference on Learning Representations. OpenReview.net, 2022

work page 2022
[15]

Beyond linearization: On quadratic and higher-order approximation of wide neural networks,

Y. Bai and J. D. Lee, “Beyond linearization: On quadratic and higher-order approximation of wide neural networks,” inInternational Conference on Learning Representations. OpenRe- view.net, 2020

work page 2020
[16]

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials,

E. Nichani, Y. Bai, and J. D. Lee, “Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mo- ha...

work page 2022
[17]

Neural networks can learn representations with gradient descent,

A. Damian, J. D. Lee, and M. Soltanolkotabi, “Neural networks can learn representations with gradient descent,” inConference on Learning Theory, 2-5 July 2022, London, UK, ser. Proceedings of Machine Learning Research, P. Loh and M. Raginsky, Eds., vol. 178. PMLR, 2022, pp. 5413–5452

work page 2022
[18]

Mean-field analysis on two-layer neural networks from a kernel per- spective,

S. Takakura and T. Suzuki, “Mean-field analysis on two-layer neural networks from a kernel per- spective,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

work page 2024
[19]

Regularization matters: A nonparametric perspective on overparametrized neural network,

T. Hu, W. Wang, C. Lin, and G. Cheng, “Regularization matters: A nonparametric perspective on overparametrized neural network,” inInternational Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Banerjee and K. Fukumizu, Eds., vol. 130. PMLR, 2021, pp. 829–837. 41

work page 2021
[20]

A non-parametric regression viewpoint : Generalization of over- parametrized deep RELU network under noisy observations,

N. Suh, H. Ko, and X. Huo, “A non-parametric regression viewpoint : Generalization of over- parametrized deep RELU network under noisy observations,” inThe Tenth International Con- ference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenRe- view.net, 2022

work page 2022
[21]

On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains,

Y. Li, Z. Yu, G. Chen, and Q. Lin, “On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains,”Journal of Machine Learning Research, vol. 25, no. 82, pp. 1–47, 2024

work page 2024
[22]

Gradient descent finds over-parameterized neural networks with sharp generalization for nonparametric regression,

Y. Yang and P. Li, “Gradient descent finds over-parameterized neural networks with sharp generalization for nonparametric regression,”arXiv preprint arXiv:2411.02904, 2024. [Online]. Available: https://arxiv.org/abs/2411.02904

work page arXiv 2024
[23]

Sharp generalization for nonparametric regression by over-parameterized neural net- works: A distribution-free analysis in spherical covariate,

Y. Yang, “Sharp generalization for nonparametric regression by over-parameterized neural net- works: A distribution-free analysis in spherical covariate,” inInternational Conference on Ma- chine Learning (ICML), 2025

work page 2025
[24]

Minimax-optimal rates for sparse additive models over kernel classes via convex programming,

G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax-optimal rates for sparse additive models over kernel classes via convex programming,”J. Mach. Learn. Res., vol. 13, pp. 389–427, 2012

work page 2012
[25]

Information theoretic bounds for compressed sensing,

S. Aeron, V. Saligrama, and M. Zhao, “Information theoretic bounds for compressed sensing,” IEEE Transactions on Information Theory, vol. 56, no. 10, pp. 5111–5130, 2010

work page 2010
[26]

Additive Regression and Other Nonparametric Models,

C. J. Stone, “Additive Regression and Other Nonparametric Models,”Ann. Statist., vol. 13, no. 2, pp. 689 – 705, 1985

work page 1985
[27]

Information-theoreticdeterminationofminimaxratesofconvergence,

Y.YangandA.Barron, “Information-theoreticdeterminationofminimaxratesofconvergence,” Ann. Statist., vol. 27, no. 5, pp. 1564 – 1599, 1999

work page 1999
[28]

Early stopping and non-parametric regression: an optimal data-dependent stopping rule,

G. Raskutti, M. J. Wainwright, and B. Yu, “Early stopping and non-parametric regression: an optimal data-dependent stopping rule,”J. Mach. Learn. Res., vol. 15, no. 1, pp. 335–366, 2014

work page 2014
[29]

Minimax optimal rates of estimation in high dimensional additive models,

M. Yuan and D.-X. Zhou, “Minimax optimal rates of estimation in high dimensional additive models,”Ann. Statist., vol. 44, no. 6, pp. 2564 – 2593, 2016

work page 2016
[30]

Local rademacher complexities,

P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,”Ann. Statist., vol. 33, no. 4, pp. 1497–1537, 08 2005

work page 2005
[31]

Local rademacher complexities and oracle inequalities in risk minimization,

V. Koltchinskii, “Local rademacher complexities and oracle inequalities in risk minimization,” Ann. Statist., vol. 34, no. 6, pp. 2593–2656, 12 2006

work page 2006
[32]

Geometric parameters of kernel machines,

S. Mendelson, “Geometric parameters of kernel machines,” inConference on Computational Learning Theory, ser. Lecture Notes in Computer Science, J. Kivinen and R. H. Sloan, Eds., vol. 2375. Springer, 2002, pp. 29–43

work page 2002
[33]

Boston, MA: Birkhäuser Boston, 1992, pp

I.Pinelis,An Approach to Inequalities for the Distributions of Infinite-Dimensional Martingales. Boston, MA: Birkhäuser Boston, 1992, pp. 128–134

work page 1992
[34]

Ledoux,Probability in Banach Spaces [electronic resource] : Isoperimetry and Processes / by Michel Ledoux, Michel Talagrand., 1st ed., ser

M. Ledoux,Probability in Banach Spaces [electronic resource] : Isoperimetry and Processes / by Michel Ledoux, Michel Talagrand., 1st ed., ser. Classics in Mathematics. Berlin, Heidelberg: Springer Berlin Heidelberg, 1991

work page 1991
[35]

Chihara,An Introduction to Orthogonal Polynomials, ser

T. Chihara,An Introduction to Orthogonal Polynomials, ser. Dover Books on Mathematics. Dover Publications, 2011. 42

work page 2011
[36]

Efthimiou and C

C. Efthimiou and C. Frye,Spherical Harmonics in p Dimensions. World Scientific Co., 2014

work page 2014
[37]

Szegő,Orthogonal Polynomials, ser

G. Szegő,Orthogonal Polynomials, ser. American Math. Soc: Colloquium publ. Amer. Math. Soc., 1975

work page 1975
[38]

Breaking the curse of dimensionality with convex neural networks,

F. R. Bach, “Breaking the curse of dimensionality with convex neural networks,”J. Mach. Learn. Res., vol. 18, pp. 19:1–19:53, 2017

work page 2017
[39]

On the inductive bias of neural tangent kernels,

A. Bietti and J. Mairal, “On the inductive bias of neural tangent kernels,” inAdvances in Neural Information Processing Systems, H.M.Wallach, H.Larochelle, A.Beygelzimer, F.d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 12873–12884

work page 2019
[40]

Basics of harmonic polynomials and spherical functions,

N. V. Krylov, “Basics of harmonic polynomials and spherical functions,” Tech. Rep. [Online]. Available: https://www-users.cse.umn.edu/~nkrylov/Moscow_2019_Sphrcal.pdf

work page
[41]

A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric,

F. T. Wright, “A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric,”Ann. Probab., vol. 1, no. 6, pp. 1068 – 1070, 1973

work page 1973
[42]

An introduction to matrix concentration inequalities,

J. A. Tropp, “An introduction to matrix concentration inequalities,”Foundations and Trends® in Machine Learning, vol. 8, no. 1-2, pp. 1–230, 2015

work page 2015
[43]

Dual attention network for scene segmentation,

J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 3146–3154

work page 2019
[44]

Eca-net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 11531–11539

work page 2020
[45]

Xcit: Cross-covariance image transformers,

A. Ali, H. Touvron, M. Caron, P. Bojanowski, M. Douze, A. Joulin, I. Laptev, N. Neverova, G. Synnaeve, J. Verbeek, and H. Jégou, “Xcit: Cross-covariance image transformers,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Infor- mation Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A....

work page 2021
[46]

Understanding matrix function normalizations incovariancepoolingthroughthelensofriemanniangeometry,

Z. Chen, Y. Song, X. Wu, G. Liu, and N. Sebe, “Understanding matrix function normalizations incovariancepoolingthroughthelensofriemanniangeometry,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025

work page 2025
[47]

Why approximate matrix square root outperforms accurate SVD in global covariance pooling?

Y. Song, N. Sebe, and W. Wang, “Why approximate matrix square root outperforms accurate SVD in global covariance pooling?” in2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 1095–1103

work page 2021
[48]

Towards a deeper understanding of global covariance pooling in deep learning: An optimization perspective,

Q. Wang, Z. Zhang, M. Gao, J. Xie, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Towards a deeper understanding of global covariance pooling in deep learning: An optimization perspective,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 15802–15819, 2023. 43

work page 2023
[49]

Rethinking attention with performers,

K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller, “Rethinking attention with performers,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

work page 2021
[50]

Random feature attention,

H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong, “Random feature attention,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

work page 2021
[51]

Efficient attention via control variates,

L. Zheng, J. Yuan, C. Wang, and L. Kong, “Efficient attention via control variates,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023
[52]

Infinite attention: NNGP and NTK for deep attention networks,

J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak, “Infinite attention: NNGP and NTK for deep attention networks,” inProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 4376–4386

work page 2020
[53]

Transformers are minimax optimal nonparametric in- context learners,

J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in- context learners,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 44

work page 2024

[1] [1]

Deep learning,

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, pp. 436–444, 2015

work page 2015

[2] [2]

Gradient descent provably optimizes over- parameterized neural networks,

S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over- parameterized neural networks,” inInternational Conference on Learning Representations, 2019

work page 2019

[3] [3]

A convergence theory for deep learning via over- parameterization,

Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over- parameterization,” inInternational Conference on Machine Learning, ser. Proceedings of Ma- chine Learning Research, vol. 97. PMLR, 2019, pp. 242–252

work page 2019

[4] [4]

Gradient descent finds global minima of deep neural networks,

S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 1675–1685

work page 2019

[5] [5]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,

S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang, “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 322–332

work page 2019

[6] [6]

An improved analysis of training over-parameterized deep neural net- works,

D. Zou and Q. Gu, “An improved analysis of training over-parameterized deep neural net- works,” inAdvances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 2053–2062

work page 2019

[7] [7]

On learning over-parameterized neural networks: A functional approxima- tion perspective,

L. Su and P. Yang, “On learning over-parameterized neural networks: A functional approxima- tion perspective,” inAdvances in Neural Information Processing Systems, 2019, pp. 2637–2646

work page 2019

[8] [8]

Neural tangent kernel: Convergence and generalization in neural networks,

A. Jacot, C. Hongler, and F. Gabriel, “Neural tangent kernel: Convergence and generalization in neural networks,” inAdvances in Neural Information Processing Systems, S. Bengio, H. M. 40 Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 8580– 8589

work page 2018

[9] [9]

Tensor programs IV: feature learning in infinite-width neural networks,

G. Yang and E. J. Hu, “Tensor programs IV: feature learning in infinite-width neural networks,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18- 24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 11727–11737

work page 2021

[10] [10]

Generalization bounds of stochastic gradient descent for wide and deep neural networks,

Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” inAdvances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 10835–10845

work page 2019

[11] [11]

Linearizedtwo-layersneuralnetworks in high dimension,

B.Ghorbani, S.Mei, T.Misiakiewicz, andA.Montanari, “Linearizedtwo-layersneuralnetworks in high dimension,”Ann. Statist., vol. 49, no. 2, pp. 1029 – 1054, 2021

work page 2021

[12] [12]

On the spectral bias of neural networks,

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville, “On the spectral bias of neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 5301–5310

work page 2019

[13] [13]

Towards understanding the spectral bias of deep learning,

Y. Cao, Z. Fang, Y. Wu, D. Zhou, and Q. Gu, “Towards understanding the spectral bias of deep learning,” inInternational Joint Conference on Artificial Intelligence, Z. Zhou, Ed. ijcai.org, 2021, pp. 2205–2211

work page 2021

[14] [14]

The spectral biasof polynomial neural networks,

M.Choraria, L.T. Dadi, G.Chrysos, J.Mairal, andV. Cevher, “The spectral biasof polynomial neural networks,” inInternational Conference on Learning Representations. OpenReview.net, 2022

work page 2022

[15] [15]

Beyond linearization: On quadratic and higher-order approximation of wide neural networks,

Y. Bai and J. D. Lee, “Beyond linearization: On quadratic and higher-order approximation of wide neural networks,” inInternational Conference on Learning Representations. OpenRe- view.net, 2020

work page 2020

[16] [16]

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials,

E. Nichani, Y. Bai, and J. D. Lee, “Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mo- ha...

work page 2022

[17] [17]

Neural networks can learn representations with gradient descent,

A. Damian, J. D. Lee, and M. Soltanolkotabi, “Neural networks can learn representations with gradient descent,” inConference on Learning Theory, 2-5 July 2022, London, UK, ser. Proceedings of Machine Learning Research, P. Loh and M. Raginsky, Eds., vol. 178. PMLR, 2022, pp. 5413–5452

work page 2022

[18] [18]

Mean-field analysis on two-layer neural networks from a kernel per- spective,

S. Takakura and T. Suzuki, “Mean-field analysis on two-layer neural networks from a kernel per- spective,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

work page 2024

[19] [19]

Regularization matters: A nonparametric perspective on overparametrized neural network,

T. Hu, W. Wang, C. Lin, and G. Cheng, “Regularization matters: A nonparametric perspective on overparametrized neural network,” inInternational Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Banerjee and K. Fukumizu, Eds., vol. 130. PMLR, 2021, pp. 829–837. 41

work page 2021

[20] [20]

A non-parametric regression viewpoint : Generalization of over- parametrized deep RELU network under noisy observations,

N. Suh, H. Ko, and X. Huo, “A non-parametric regression viewpoint : Generalization of over- parametrized deep RELU network under noisy observations,” inThe Tenth International Con- ference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenRe- view.net, 2022

work page 2022

[21] [21]

On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains,

Y. Li, Z. Yu, G. Chen, and Q. Lin, “On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains,”Journal of Machine Learning Research, vol. 25, no. 82, pp. 1–47, 2024

work page 2024

[22] [22]

Gradient descent finds over-parameterized neural networks with sharp generalization for nonparametric regression,

Y. Yang and P. Li, “Gradient descent finds over-parameterized neural networks with sharp generalization for nonparametric regression,”arXiv preprint arXiv:2411.02904, 2024. [Online]. Available: https://arxiv.org/abs/2411.02904

work page arXiv 2024

[23] [23]

Sharp generalization for nonparametric regression by over-parameterized neural net- works: A distribution-free analysis in spherical covariate,

Y. Yang, “Sharp generalization for nonparametric regression by over-parameterized neural net- works: A distribution-free analysis in spherical covariate,” inInternational Conference on Ma- chine Learning (ICML), 2025

work page 2025

[24] [24]

Minimax-optimal rates for sparse additive models over kernel classes via convex programming,

G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax-optimal rates for sparse additive models over kernel classes via convex programming,”J. Mach. Learn. Res., vol. 13, pp. 389–427, 2012

work page 2012

[25] [25]

Information theoretic bounds for compressed sensing,

S. Aeron, V. Saligrama, and M. Zhao, “Information theoretic bounds for compressed sensing,” IEEE Transactions on Information Theory, vol. 56, no. 10, pp. 5111–5130, 2010

work page 2010

[26] [26]

Additive Regression and Other Nonparametric Models,

C. J. Stone, “Additive Regression and Other Nonparametric Models,”Ann. Statist., vol. 13, no. 2, pp. 689 – 705, 1985

work page 1985

[27] [27]

Information-theoreticdeterminationofminimaxratesofconvergence,

Y.YangandA.Barron, “Information-theoreticdeterminationofminimaxratesofconvergence,” Ann. Statist., vol. 27, no. 5, pp. 1564 – 1599, 1999

work page 1999

[28] [28]

Early stopping and non-parametric regression: an optimal data-dependent stopping rule,

G. Raskutti, M. J. Wainwright, and B. Yu, “Early stopping and non-parametric regression: an optimal data-dependent stopping rule,”J. Mach. Learn. Res., vol. 15, no. 1, pp. 335–366, 2014

work page 2014

[29] [29]

Minimax optimal rates of estimation in high dimensional additive models,

M. Yuan and D.-X. Zhou, “Minimax optimal rates of estimation in high dimensional additive models,”Ann. Statist., vol. 44, no. 6, pp. 2564 – 2593, 2016

work page 2016

[30] [30]

Local rademacher complexities,

P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,”Ann. Statist., vol. 33, no. 4, pp. 1497–1537, 08 2005

work page 2005

[31] [31]

Local rademacher complexities and oracle inequalities in risk minimization,

V. Koltchinskii, “Local rademacher complexities and oracle inequalities in risk minimization,” Ann. Statist., vol. 34, no. 6, pp. 2593–2656, 12 2006

work page 2006

[32] [32]

Geometric parameters of kernel machines,

S. Mendelson, “Geometric parameters of kernel machines,” inConference on Computational Learning Theory, ser. Lecture Notes in Computer Science, J. Kivinen and R. H. Sloan, Eds., vol. 2375. Springer, 2002, pp. 29–43

work page 2002

[33] [33]

Boston, MA: Birkhäuser Boston, 1992, pp

I.Pinelis,An Approach to Inequalities for the Distributions of Infinite-Dimensional Martingales. Boston, MA: Birkhäuser Boston, 1992, pp. 128–134

work page 1992

[34] [34]

Ledoux,Probability in Banach Spaces [electronic resource] : Isoperimetry and Processes / by Michel Ledoux, Michel Talagrand., 1st ed., ser

M. Ledoux,Probability in Banach Spaces [electronic resource] : Isoperimetry and Processes / by Michel Ledoux, Michel Talagrand., 1st ed., ser. Classics in Mathematics. Berlin, Heidelberg: Springer Berlin Heidelberg, 1991

work page 1991

[35] [35]

Chihara,An Introduction to Orthogonal Polynomials, ser

T. Chihara,An Introduction to Orthogonal Polynomials, ser. Dover Books on Mathematics. Dover Publications, 2011. 42

work page 2011

[36] [36]

Efthimiou and C

C. Efthimiou and C. Frye,Spherical Harmonics in p Dimensions. World Scientific Co., 2014

work page 2014

[37] [37]

Szegő,Orthogonal Polynomials, ser

G. Szegő,Orthogonal Polynomials, ser. American Math. Soc: Colloquium publ. Amer. Math. Soc., 1975

work page 1975

[38] [38]

Breaking the curse of dimensionality with convex neural networks,

F. R. Bach, “Breaking the curse of dimensionality with convex neural networks,”J. Mach. Learn. Res., vol. 18, pp. 19:1–19:53, 2017

work page 2017

[39] [39]

On the inductive bias of neural tangent kernels,

A. Bietti and J. Mairal, “On the inductive bias of neural tangent kernels,” inAdvances in Neural Information Processing Systems, H.M.Wallach, H.Larochelle, A.Beygelzimer, F.d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 12873–12884

work page 2019

[40] [40]

Basics of harmonic polynomials and spherical functions,

N. V. Krylov, “Basics of harmonic polynomials and spherical functions,” Tech. Rep. [Online]. Available: https://www-users.cse.umn.edu/~nkrylov/Moscow_2019_Sphrcal.pdf

work page

[41] [41]

A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric,

F. T. Wright, “A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric,”Ann. Probab., vol. 1, no. 6, pp. 1068 – 1070, 1973

work page 1973

[42] [42]

An introduction to matrix concentration inequalities,

J. A. Tropp, “An introduction to matrix concentration inequalities,”Foundations and Trends® in Machine Learning, vol. 8, no. 1-2, pp. 1–230, 2015

work page 2015

[43] [43]

Dual attention network for scene segmentation,

J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 3146–3154

work page 2019

[44] [44]

Eca-net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 11531–11539

work page 2020

[45] [45]

Xcit: Cross-covariance image transformers,

A. Ali, H. Touvron, M. Caron, P. Bojanowski, M. Douze, A. Joulin, I. Laptev, N. Neverova, G. Synnaeve, J. Verbeek, and H. Jégou, “Xcit: Cross-covariance image transformers,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Infor- mation Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A....

work page 2021

[46] [46]

Understanding matrix function normalizations incovariancepoolingthroughthelensofriemanniangeometry,

Z. Chen, Y. Song, X. Wu, G. Liu, and N. Sebe, “Understanding matrix function normalizations incovariancepoolingthroughthelensofriemanniangeometry,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025

work page 2025

[47] [47]

Why approximate matrix square root outperforms accurate SVD in global covariance pooling?

Y. Song, N. Sebe, and W. Wang, “Why approximate matrix square root outperforms accurate SVD in global covariance pooling?” in2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 1095–1103

work page 2021

[48] [48]

Towards a deeper understanding of global covariance pooling in deep learning: An optimization perspective,

Q. Wang, Z. Zhang, M. Gao, J. Xie, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Towards a deeper understanding of global covariance pooling in deep learning: An optimization perspective,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 15802–15819, 2023. 43

work page 2023

[49] [49]

Rethinking attention with performers,

K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller, “Rethinking attention with performers,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

work page 2021

[50] [50]

Random feature attention,

H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong, “Random feature attention,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

work page 2021

[51] [51]

Efficient attention via control variates,

L. Zheng, J. Yuan, C. Wang, and L. Kong, “Efficient attention via control variates,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023

[52] [52]

Infinite attention: NNGP and NTK for deep attention networks,

J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak, “Infinite attention: NNGP and NTK for deep attention networks,” inProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 4376–4386

work page 2020

[53] [53]

Transformers are minimax optimal nonparametric in- context learners,

J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in- context learners,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 44

work page 2024