Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention
Pith reviewed 2026-05-16 19:59 UTC · model grok-4.3
The pith
Two-layer neural networks with learnable channel attention achieve the minimax optimal sample complexity n ~ d to the ell zero over epsilon for low-degree spherical polynomials.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For any regression risk epsilon in (0,1), a two-layer NN with channel attention and finite width trained by vanilla GD requires sample complexity n as Theta of d to the ell zero over epsilon with high probability and renders nonparametric regression risk of order Theta of d to the ell zero over n that is minimax optimal, via a two-stage process in which a provable learnable channel selection algorithm identifies the ground-truth degree ell zero from L greater than or equal to ell zero channels before standard GD trains the second layer.
What carries the argument
The learnable channel selection algorithm, a harmonic-degree selection process that identifies the ground-truth channel number ell zero from the first-layer activation channels before second-layer training.
If this is right
- The achieved sample complexity n as Theta of d to the ell zero over epsilon is the lowest possible with high probability.
- The regression risk bound of Theta of d to the ell zero over n is sharp and matches the minimax lower bound.
- The result holds for finite-width networks trained with vanilla gradient descent.
- Feature learning through the channel attention step is what removes the extra log d and epsilon to the minus two factors present in prior bounds.
Where Pith is reading between the lines
- Similar explicit selection mechanisms may allow overparameterized networks to reach optimality in other high-dimensional nonparametric regression tasks.
- The separation into a selection stage followed by gradient descent suggests that pure end-to-end training without selection may not suffice for minimax rates on polynomial classes.
- The approach could be tested on related function classes such as low-degree polynomials on the hypercube or other compact manifolds.
Load-bearing premise
The network must contain a provable learnable channel selection algorithm that correctly identifies the ground-truth degree ell zero from the available channels before the second stage of gradient descent.
What would settle it
An experiment showing that removing the channel selection step forces the sample complexity back to Theta of d to the ell zero times max of epsilon to the minus two or log d, or produces risk strictly worse than Theta of d to the ell zero over n.
read the original abstract
We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\eps)$ with high probability, in contrast with the representative sample complexity $\Theta\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function, $\ell_0$, from $L \ge \ell_0$ channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies learning a low-degree spherical polynomial of fixed degree ℓ₀ = Θ(1) on the unit sphere in ℝ^d via an over-parameterized two-layer neural network equipped with a learnable channel attention mechanism. It proposes a two-stage procedure: (1) a provable channel-selection step that identifies the ground-truth degree ℓ₀ from L ≥ ℓ₀ channels, followed by (2) vanilla gradient descent on the second-layer weights using only the selected channels. The central claim is that this procedure achieves sample complexity n ≍ Θ(d^{ℓ₀}/ε) for any regression risk ε ∈ (0,1) with high probability, yielding a sharp nonparametric risk of Θ(d^{ℓ₀}/n) that matches the minimax rate for a kernel of rank Θ(d^{ℓ₀}).
Significance. If the proofs are complete, the result would be a meaningful advance in the theory of feature learning for shallow networks: it supplies the first finite-width, vanilla-GD guarantee that attains the minimax rate for this function class, explicitly separating the benefit of learnable channel attention from kernel methods that incur extra max(ε^{-2}, log d) factors. The explicit two-stage construction and the matching lower bound are strengths that could influence subsequent work on attention-based feature selection.
major comments (2)
- [§3.2 and Theorem 4.1] §3.2 (Channel Selection Analysis) and Theorem 4.1: The headline sample-complexity claim n ≍ Θ(d^{ℓ₀}/ε) requires that the learnable channel-selection step succeeds with probability 1−o(1) at exactly this n. The concentration argument for the attention weights (Eq. (8) and the subsequent union bound over L channels) appears to produce a failure probability that decays only after an extra Ω(log(d/δ)) factor in n; setting δ = Θ(ε) to control the final risk therefore re-introduces a logarithmic term that the abstract contrasts against the prior Θ(d^{ℓ₀} max{ε^{-2}, log d}) bound.
- [§4.3] §4.3 (Risk Decomposition): The sharp rate Θ(d^{ℓ₀}/n) for the trained network is obtained by showing that, conditional on correct channel selection, the second-stage GD error is O(d^{ℓ₀}/n). However, the decomposition does not quantify the probability that selection fails on a constant fraction of draws; if this probability is Ω(ε), the unconditional risk bound reverts to the slower rate the paper claims to improve upon.
minor comments (2)
- [§2] §2 (Notation): The definition of the spherical-harmonic basis and the precise parameterization of the channel attention weights (Eq. (3)) would benefit from an explicit statement of the normalization constants and the range of the learnable scalars.
- [Figure 1 and Algorithm 1] Figure 1 and Algorithm 1: The schematic of the two-stage procedure is clear, but the caption should state the precise dependence of the width m and the number of channels L on d and ℓ₀ that is used in the theorems.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below, providing clarifications on the concentration and risk bounds. We will incorporate explicit remarks into the revised manuscript.
read point-by-point responses
-
Referee: [§3.2 and Theorem 4.1] §3.2 (Channel Selection Analysis) and Theorem 4.1: The headline sample-complexity claim n ≍ Θ(d^{ℓ₀}/ε) requires that the learnable channel-selection step succeeds with probability 1−o(1) at exactly this n. The concentration argument for the attention weights (Eq. (8) and the subsequent union bound over L channels) appears to produce a failure probability that decays only after an extra Ω(log(d/δ)) factor in n; setting δ = Θ(ε) to control the final risk therefore re-introduces a logarithmic term that the abstract contrasts against the prior Θ(d^{ℓ₀} max{ε^{-2}, log d}) bound.
Authors: We appreciate the referee highlighting this detail. The concentration inequality for the attention weights (Eq. (8)) is sub-Gaussian and yields a per-channel failure probability of exp(−Ω(n / d^{ℓ₀})). Substituting the claimed n ≍ Θ(d^{ℓ₀}/ε) produces failure probability exp(−Ω(1/ε)), which is o(ε^k) for any fixed k. Because ℓ₀ = Θ(1) we may take L = O(1) (or at most polylog(d) without changing the leading term), so the union bound over L channels remains exponentially small in 1/ε and does not require any extra logarithmic factor in n. We will add a short paragraph after Eq. (8) making this exponential decay explicit and confirming that the sample complexity remains Θ(d^{ℓ₀}/ε) with probability 1−o(1). revision: partial
-
Referee: [§4.3] §4.3 (Risk Decomposition): The sharp rate Θ(d^{ℓ₀}/n) for the trained network is obtained by showing that, conditional on correct channel selection, the second-stage GD error is O(d^{ℓ₀}/n). However, the decomposition does not quantify the probability that selection fails on a constant fraction of draws; if this probability is Ω(ε), the unconditional risk bound reverts to the slower rate the paper claims to improve upon.
Authors: We thank the referee for this observation. The failure probability of the channel-selection step is exp(−Ω(1/ε)) as established in §3.2. Consequently, the contribution of any failure event to the unconditional risk is at most exp(−Ω(1/ε)) ⋅ O(1) = o(ε). Adding this term to the conditional risk O(d^{ℓ₀}/n) = O(ε) does not change the leading Θ(d^{ℓ₀}/n) rate. We will revise the risk decomposition in §4.3 to state the unconditional bound explicitly, thereby confirming that the claimed minimax-optimal rate holds with high probability. revision: yes
Circularity Check
Derivation chain is self-contained with no reduction to inputs by construction
full rationale
The paper presents a two-stage procedure consisting of a learnable channel selection algorithm that identifies ell_0 from L channels, followed by GD on the second layer. The claimed sample complexity n ~ Theta(d^{ell_0}/eps) and the resulting nonparametric risk bound Theta(d^{ell_0}/n) are stated as consequences of this procedure being provable at the given n, with the minimax optimality comparison drawn externally to kernel methods of rank Theta(d^{ell_0}). No equation or step reduces the final rate to a fitted parameter, a self-defined quantity, or a self-citation chain by construction; the selection success is asserted as an internal proof obligation rather than an assumption that tautologically encodes the target bound. The derivation therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient descent on the second-layer weights after channel selection converges to the minimax rate under the stated over-parameterization
invented entities (1)
-
learnable channel attention mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
minimax optimal rate for the regression risk with a kernel of rank Θ(d^{ℓ₀}) is Θ(d^{ℓ₀}/n)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
activation function σ_τ(x,x') = Σ_ℓ τ_ℓ P^{(d)}_ℓ(⟨x,x'⟩)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, pp. 436–444, 2015
work page 2015
-
[2]
Gradient descent provably optimizes over- parameterized neural networks,
S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over- parameterized neural networks,” inInternational Conference on Learning Representations, 2019
work page 2019
-
[3]
A convergence theory for deep learning via over- parameterization,
Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over- parameterization,” inInternational Conference on Machine Learning, ser. Proceedings of Ma- chine Learning Research, vol. 97. PMLR, 2019, pp. 242–252
work page 2019
-
[4]
Gradient descent finds global minima of deep neural networks,
S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 1675–1685
work page 2019
-
[5]
S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang, “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 322–332
work page 2019
-
[6]
An improved analysis of training over-parameterized deep neural net- works,
D. Zou and Q. Gu, “An improved analysis of training over-parameterized deep neural net- works,” inAdvances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 2053–2062
work page 2019
-
[7]
On learning over-parameterized neural networks: A functional approxima- tion perspective,
L. Su and P. Yang, “On learning over-parameterized neural networks: A functional approxima- tion perspective,” inAdvances in Neural Information Processing Systems, 2019, pp. 2637–2646
work page 2019
-
[8]
Neural tangent kernel: Convergence and generalization in neural networks,
A. Jacot, C. Hongler, and F. Gabriel, “Neural tangent kernel: Convergence and generalization in neural networks,” inAdvances in Neural Information Processing Systems, S. Bengio, H. M. 40 Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 8580– 8589
work page 2018
-
[9]
Tensor programs IV: feature learning in infinite-width neural networks,
G. Yang and E. J. Hu, “Tensor programs IV: feature learning in infinite-width neural networks,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18- 24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 11727–11737
work page 2021
-
[10]
Generalization bounds of stochastic gradient descent for wide and deep neural networks,
Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” inAdvances in Neural Information Processing Systems, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 10835–10845
work page 2019
-
[11]
Linearizedtwo-layersneuralnetworks in high dimension,
B.Ghorbani, S.Mei, T.Misiakiewicz, andA.Montanari, “Linearizedtwo-layersneuralnetworks in high dimension,”Ann. Statist., vol. 49, no. 2, pp. 1029 – 1054, 2021
work page 2021
-
[12]
On the spectral bias of neural networks,
N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville, “On the spectral bias of neural networks,” inInternational Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 5301–5310
work page 2019
-
[13]
Towards understanding the spectral bias of deep learning,
Y. Cao, Z. Fang, Y. Wu, D. Zhou, and Q. Gu, “Towards understanding the spectral bias of deep learning,” inInternational Joint Conference on Artificial Intelligence, Z. Zhou, Ed. ijcai.org, 2021, pp. 2205–2211
work page 2021
-
[14]
The spectral biasof polynomial neural networks,
M.Choraria, L.T. Dadi, G.Chrysos, J.Mairal, andV. Cevher, “The spectral biasof polynomial neural networks,” inInternational Conference on Learning Representations. OpenReview.net, 2022
work page 2022
-
[15]
Beyond linearization: On quadratic and higher-order approximation of wide neural networks,
Y. Bai and J. D. Lee, “Beyond linearization: On quadratic and higher-order approximation of wide neural networks,” inInternational Conference on Learning Representations. OpenRe- view.net, 2020
work page 2020
-
[16]
E. Nichani, Y. Bai, and J. D. Lee, “Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mo- ha...
work page 2022
-
[17]
Neural networks can learn representations with gradient descent,
A. Damian, J. D. Lee, and M. Soltanolkotabi, “Neural networks can learn representations with gradient descent,” inConference on Learning Theory, 2-5 July 2022, London, UK, ser. Proceedings of Machine Learning Research, P. Loh and M. Raginsky, Eds., vol. 178. PMLR, 2022, pp. 5413–5452
work page 2022
-
[18]
Mean-field analysis on two-layer neural networks from a kernel per- spective,
S. Takakura and T. Suzuki, “Mean-field analysis on two-layer neural networks from a kernel per- spective,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024
work page 2024
-
[19]
Regularization matters: A nonparametric perspective on overparametrized neural network,
T. Hu, W. Wang, C. Lin, and G. Cheng, “Regularization matters: A nonparametric perspective on overparametrized neural network,” inInternational Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Banerjee and K. Fukumizu, Eds., vol. 130. PMLR, 2021, pp. 829–837. 41
work page 2021
-
[20]
N. Suh, H. Ko, and X. Huo, “A non-parametric regression viewpoint : Generalization of over- parametrized deep RELU network under noisy observations,” inThe Tenth International Con- ference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenRe- view.net, 2022
work page 2022
-
[21]
Y. Li, Z. Yu, G. Chen, and Q. Lin, “On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains,”Journal of Machine Learning Research, vol. 25, no. 82, pp. 1–47, 2024
work page 2024
-
[22]
Y. Yang and P. Li, “Gradient descent finds over-parameterized neural networks with sharp generalization for nonparametric regression,”arXiv preprint arXiv:2411.02904, 2024. [Online]. Available: https://arxiv.org/abs/2411.02904
-
[23]
Y. Yang, “Sharp generalization for nonparametric regression by over-parameterized neural net- works: A distribution-free analysis in spherical covariate,” inInternational Conference on Ma- chine Learning (ICML), 2025
work page 2025
-
[24]
Minimax-optimal rates for sparse additive models over kernel classes via convex programming,
G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax-optimal rates for sparse additive models over kernel classes via convex programming,”J. Mach. Learn. Res., vol. 13, pp. 389–427, 2012
work page 2012
-
[25]
Information theoretic bounds for compressed sensing,
S. Aeron, V. Saligrama, and M. Zhao, “Information theoretic bounds for compressed sensing,” IEEE Transactions on Information Theory, vol. 56, no. 10, pp. 5111–5130, 2010
work page 2010
-
[26]
Additive Regression and Other Nonparametric Models,
C. J. Stone, “Additive Regression and Other Nonparametric Models,”Ann. Statist., vol. 13, no. 2, pp. 689 – 705, 1985
work page 1985
-
[27]
Information-theoreticdeterminationofminimaxratesofconvergence,
Y.YangandA.Barron, “Information-theoreticdeterminationofminimaxratesofconvergence,” Ann. Statist., vol. 27, no. 5, pp. 1564 – 1599, 1999
work page 1999
-
[28]
Early stopping and non-parametric regression: an optimal data-dependent stopping rule,
G. Raskutti, M. J. Wainwright, and B. Yu, “Early stopping and non-parametric regression: an optimal data-dependent stopping rule,”J. Mach. Learn. Res., vol. 15, no. 1, pp. 335–366, 2014
work page 2014
-
[29]
Minimax optimal rates of estimation in high dimensional additive models,
M. Yuan and D.-X. Zhou, “Minimax optimal rates of estimation in high dimensional additive models,”Ann. Statist., vol. 44, no. 6, pp. 2564 – 2593, 2016
work page 2016
-
[30]
Local rademacher complexities,
P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,”Ann. Statist., vol. 33, no. 4, pp. 1497–1537, 08 2005
work page 2005
-
[31]
Local rademacher complexities and oracle inequalities in risk minimization,
V. Koltchinskii, “Local rademacher complexities and oracle inequalities in risk minimization,” Ann. Statist., vol. 34, no. 6, pp. 2593–2656, 12 2006
work page 2006
-
[32]
Geometric parameters of kernel machines,
S. Mendelson, “Geometric parameters of kernel machines,” inConference on Computational Learning Theory, ser. Lecture Notes in Computer Science, J. Kivinen and R. H. Sloan, Eds., vol. 2375. Springer, 2002, pp. 29–43
work page 2002
-
[33]
Boston, MA: Birkhäuser Boston, 1992, pp
I.Pinelis,An Approach to Inequalities for the Distributions of Infinite-Dimensional Martingales. Boston, MA: Birkhäuser Boston, 1992, pp. 128–134
work page 1992
-
[34]
M. Ledoux,Probability in Banach Spaces [electronic resource] : Isoperimetry and Processes / by Michel Ledoux, Michel Talagrand., 1st ed., ser. Classics in Mathematics. Berlin, Heidelberg: Springer Berlin Heidelberg, 1991
work page 1991
-
[35]
Chihara,An Introduction to Orthogonal Polynomials, ser
T. Chihara,An Introduction to Orthogonal Polynomials, ser. Dover Books on Mathematics. Dover Publications, 2011. 42
work page 2011
-
[36]
C. Efthimiou and C. Frye,Spherical Harmonics in p Dimensions. World Scientific Co., 2014
work page 2014
-
[37]
Szegő,Orthogonal Polynomials, ser
G. Szegő,Orthogonal Polynomials, ser. American Math. Soc: Colloquium publ. Amer. Math. Soc., 1975
work page 1975
-
[38]
Breaking the curse of dimensionality with convex neural networks,
F. R. Bach, “Breaking the curse of dimensionality with convex neural networks,”J. Mach. Learn. Res., vol. 18, pp. 19:1–19:53, 2017
work page 2017
-
[39]
On the inductive bias of neural tangent kernels,
A. Bietti and J. Mairal, “On the inductive bias of neural tangent kernels,” inAdvances in Neural Information Processing Systems, H.M.Wallach, H.Larochelle, A.Beygelzimer, F.d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 12873–12884
work page 2019
-
[40]
Basics of harmonic polynomials and spherical functions,
N. V. Krylov, “Basics of harmonic polynomials and spherical functions,” Tech. Rep. [Online]. Available: https://www-users.cse.umn.edu/~nkrylov/Moscow_2019_Sphrcal.pdf
-
[41]
F. T. Wright, “A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric,”Ann. Probab., vol. 1, no. 6, pp. 1068 – 1070, 1973
work page 1973
-
[42]
An introduction to matrix concentration inequalities,
J. A. Tropp, “An introduction to matrix concentration inequalities,”Foundations and Trends® in Machine Learning, vol. 8, no. 1-2, pp. 1–230, 2015
work page 2015
-
[43]
Dual attention network for scene segmentation,
J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 3146–3154
work page 2019
-
[44]
Eca-net: Efficient channel attention for deep convolutional neural networks,
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 11531–11539
work page 2020
-
[45]
Xcit: Cross-covariance image transformers,
A. Ali, H. Touvron, M. Caron, P. Bojanowski, M. Douze, A. Joulin, I. Laptev, N. Neverova, G. Synnaeve, J. Verbeek, and H. Jégou, “Xcit: Cross-covariance image transformers,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Infor- mation Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A....
work page 2021
-
[46]
Understanding matrix function normalizations incovariancepoolingthroughthelensofriemanniangeometry,
Z. Chen, Y. Song, X. Wu, G. Liu, and N. Sebe, “Understanding matrix function normalizations incovariancepoolingthroughthelensofriemanniangeometry,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025
work page 2025
-
[47]
Why approximate matrix square root outperforms accurate SVD in global covariance pooling?
Y. Song, N. Sebe, and W. Wang, “Why approximate matrix square root outperforms accurate SVD in global covariance pooling?” in2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 1095–1103
work page 2021
-
[48]
Q. Wang, Z. Zhang, M. Gao, J. Xie, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Towards a deeper understanding of global covariance pooling in deep learning: An optimization perspective,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 15802–15819, 2023. 43
work page 2023
-
[49]
Rethinking attention with performers,
K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller, “Rethinking attention with performers,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021
work page 2021
-
[50]
H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong, “Random feature attention,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021
work page 2021
-
[51]
Efficient attention via control variates,
L. Zheng, J. Yuan, C. Wang, and L. Kong, “Efficient attention via control variates,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
work page 2023
-
[52]
Infinite attention: NNGP and NTK for deep attention networks,
J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak, “Infinite attention: NNGP and NTK for deep attention networks,” inProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 4376–4386
work page 2020
-
[53]
Transformers are minimax optimal nonparametric in- context learners,
J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in- context learners,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 44
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.