Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels
Pith reviewed 2026-05-18 13:51 UTC · model grok-4.3
The pith
For sequence models with effective span dimension at most K the minimax excess risk of spectral algorithms scales as sigma squared times K.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For sequence models whose effective span dimension is at most K, the minimax excess risk of spectral algorithms scales as sigma squared K. Over-parameterized gradient flow can reduce the effective span dimension, establishing a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms.
What carries the argument
The effective span dimension (ESD), an alignment-sensitive measure of problem complexity that depends jointly on the signal, spectrum, and noise level to bound the minimax risk.
If this is right
- The sigma squared K rate applies to learned kernels in sequence models without extra decay assumptions.
- Gradient flow training lowers the effective span dimension and therefore tightens the generalization bound.
- The same ESD framework yields rates for linear models and ordinary RKHS regression.
- Numerical checks confirm that the predicted scaling holds in simulated sequence settings.
Where Pith is reading between the lines
- If the ESD turns out to be small after training on real data, learned kernels could systematically beat fixed kernels on generalization.
- The reduction of ESD by gradient flow may supply a theoretical account for why over-parameterized models often generalize despite high capacity.
- One could design training procedures that explicitly minimize the ESD instead of the usual loss.
- The measure might extend to other adaptive representations such as deep networks to predict their effective complexity.
Load-bearing premise
The problem must be a sequence model in which the effective span dimension is well-defined and finite without any eigen-decay or source conditions.
What would settle it
Fix a sequence model and kernel, compute its effective span dimension K, train the spectral algorithm, and check whether the observed excess risk is within a constant factor of sigma squared K across several noise levels.
Figures
read the original abstract
We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $\sigma^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $\sigma^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the effective span dimension (ESD), an alignment-sensitive complexity measure depending jointly on signal, spectrum, and noise level, for spectral algorithms with learned kernels. It proves that sequence models with ESD at most K have minimax excess risk scaling as σ²K, shows that over-parameterized gradient flow reduces the ESD, and extends the framework to linear models and RKHS regression with numerical experiments.
Significance. If the results hold, the ESD offers a novel perspective on generalization for learned kernels that avoids eigen-decay and source conditions, crediting the work for its internal consistency (the rate follows directly from the ESD definition plus standard spectral arguments) and for establishing a verifiable link between gradient-flow dynamics and reduced complexity. The extensions and experiments strengthen the generality claim.
minor comments (2)
- [§2] §2 (sequence model setup): explicitly list the minimal conditions under which the ESD remains finite and well-defined for arbitrary kernels, to address potential reader questions about unstated measurability requirements.
- [gradient flow analysis] The gradient-flow reduction claim would be strengthened by a brief remark on how the over-parameterization ratio interacts with the noise level σ² in the ESD update.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript, the accurate summary of the ESD framework, and the recommendation for minor revision. We appreciate the recognition of the internal consistency of the bounds, the novel alignment-sensitive perspective that avoids eigen-decay and source conditions, and the verifiable connection between gradient-flow dynamics and reduced complexity.
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper defines the effective span dimension (ESD) as an alignment-sensitive measure jointly depending on signal, spectrum, and noise level, without eigen-decay or source conditions. It then proves that minimax excess risk scales as sigma^2 times ESD (or at most sigma^2 K when ESD <= K) for sequence models, using standard spectral bias arguments. This is not a self-definitional reduction because ESD is constructed as an independent complexity measure whose finiteness holds for arbitrary kernels, and the risk bound follows from explicit analysis rather than by tautological construction. The gradient-flow reduction of ESD is a separate dynamical property. No load-bearing self-citations, fitted inputs renamed as predictions, or ansatz smuggling appear; the framework extends independently to linear models and RKHS regression. The chain is verifiable against external benchmarks and contains independent content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sequence model setting suffices to capture learned-kernel spectral algorithms
invented entities (1)
-
Effective span dimension (ESD)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 3.1 … d† = min{k : (1/k) ∑_{i>k} (θ_πi^*)² ≤ σ²}
-
IndisputableMonolith/Foundation/DimensionForcing.leandimension_forcing_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.3 … inf sup R(bθ,θ*) ≍ K σ² over F_K,λ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Agarwal, D. Shah, D. Shen, and D. Song. On robustness of principal component regression.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[2]
A Convergence Theory for Deep Learning via Over-Parameterization
Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over- parameterization, June 2019. URLhttp://arxiv.org/abs/1811.03962
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [3]
-
[4]
J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, and G. Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation, May 2022
work page 2022
-
[5]
P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, Dec. 2020. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1907378117
-
[6]
P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020
work page 2020
-
[7]
D. Barzilai and O. Shamir. Generalization in kernel regression under realistic assumptions. arXiv preprint arXiv:2312.15995, 2023
-
[8]
F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007. doi: 10.1016/j.jco.2006.07.001
-
[9]
X. Bing, F. Bunea, S. Strimas-Mackey, and M. Wegkamp. Prediction under latent factor regression: Adaptive pcr, interpolating predictors and beyond.Journal of Machine Learning Research, 22(177):1–50, 2021
work page 2021
-
[10]
B. Bordelon, A. Atanasov, and C. Pehlevan. How feature learning can improve neural scaling laws. InThe Thirteenth International Conference on Learning Representations,
-
[11]
URLhttps://openreview.net/forum?id=dEypApI1MZ
-
[12]
L. D. Brown, T. T. Cai, M. G. Low, and C.-H. Zhang. Asymptotic equivalence theory for nonparametric regression with random design.The Annals of Statistics, 30(3):688–707,
-
[13]
ISSN 0090-5364. doi: 10.1214/aos/1028674838
-
[14]
A. Caponnetto and E. D. Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007
work page 2007
- [15]
-
[16]
N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. Advances in neural information processing systems, 14, 2001
work page 2001
-
[17]
H. W. Engl, M. Hanke, and A. Neubauer.Regularization of Inverse Problems, volume
-
[18]
Springer Science & Business Media, 1996
work page 1996
-
[19]
K. Gatmiry, S. Jegelka, and J. Kelner. Optimization and Adaptive Generalization of Three layer Neural Networks. InInternational Conference on Learning Representations, Oct. 2021. URLhttps://openreview.net/forum?id=dPyRNUlttBv
work page 2021
- [20]
-
[21]
L. L. Gerfo, L. Rosasco, F. Odone, E. D. Vito, and A. Verri. Spectral algorithms for supervised learning.Neural Computation, 20(7):1873–1897, 2008. 26
work page 2008
-
[22]
B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33: 14820–14830, 2020
work page 2020
-
[23]
A. Green and E. Romanov. The high-dimensional asymptotics of principal component regression.arXiv preprint arXiv:2405.11676, 2024
-
[24]
N. T. Huang, D. W. Hogg, and S. Villar. Dimensionality reduction, regularization, and generalization in overparameterized regressions.SIAM Journal on Mathematics of Data Science, 4(1):126–152, 2022
work page 2022
-
[25]
L. Hucker and M. Wahl. A note on the prediction error of principal component regression in high dimensions.Theory of Probability and Mathematical Statistics, 109:37–53, 2023
work page 2023
-
[26]
A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and gen- eralization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URLhttps://proceedings.neurips. cc/paper/2018/file/5a4be...
work page 2018
-
[27]
I. M. Johnstone. Gaussian estimation: Sequence and wavelet models. 2017
work page 2017
-
[28]
S. Karp, E. Winston, Y. Li, and A. Singh. Local signal adaptivity: Provable feature learning in neural networks beyond kernels.Advances in Neural Information Processing Systems, 34:24883–24897, 2021
work page 2021
-
[29]
S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representa- tions revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019
work page 2019
- [30]
- [31]
- [32]
- [33]
-
[34]
Y. H. Liu, A. Baratin, J. Cornford, S. Mihalas, E. T. SheaBrown, and G. Lajoie. How connectivity structure shapes rich and lazy learning in neural circuits. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[35]
P. Mathé and S. V. Pereverzev. Geometry of linear ill-posed problems in variable hilbert scales.Inverse problems, 19(3):789, 2003. 27
work page 2003
-
[36]
A. Radhakrishnan, D. Beaglehole, P. Pandit, and M. Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383(6690):1461–1467, 2024
work page 2024
-
[37]
L. Rosasco, E. De Vito, and A. Verri. Spectral methods for regularization in learning theory.DISI, Universita degli Studi di Genova, Italy, Technical Report DISI-TR-05-18, 2005
work page 2005
-
[38]
B. Schölkopf and A. J. Smola.Learning with Kernels: Support Vector Machines, Regular- ization, Optimization, and Beyond. MIT Press, 2002
work page 2002
-
[39]
M. Seleznova and G. Kutyniok. Analyzing finite neural networks: Can we trust neural tangent kernel theory? InMathematical and Scientific Machine Learning, pages 868–895. PMLR, 2022. URLhttps://proceedings.mlr.press/v145/seleznova22a.html
work page 2022
-
[40]
Z. Shi, J. Wei, and Y. Liang. Provable guarantees for neural networks via gradient feature learning.Advances in Neural Information Processing Systems, 36:55848–55918, 2023
work page 2023
-
[41]
I. Steinwart and A. Christmann.Support vector machines. Springer Science & Business Media, 2008
work page 2008
-
[42]
I. Steinwart and C. Scovel. Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs. 2012. doi: 10.1007/S00365-012-9153-3
-
[43]
A. Tsigler and P. L. Bartlett. Benign overfitting in ridge regression.Journal of Machine Learning Research, 24(123):1–76, 2023
work page 2023
-
[44]
G. Wahba.Spline Models for Observational Data, volume 59 ofCBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, 1990
work page 1990
-
[45]
C. Wang, X. He, Y. Wang, and J. Wang. On the target-kernel alignment: a unified analysis with kernel complexity.Advances in Neural Information Processing Systems, 37: 40434–40485, 2024
work page 2024
- [46]
-
[47]
B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. InProceedings of Thirty Third Conference on Learning Theory, pages 3635–3673. PMLR, July 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html
work page 2020
- [48]
- [49]
- [50]
-
[51]
Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26:289–315, Aug. 2007. doi: 10.1007/s00365-006-0663-2
-
[52]
B. Yu. Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics, pages 423–435. Springer, 1997
work page 1997
- [53]
- [54]
-
[55]
T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005. doi: 10.1162/0899766054323008. 29 A Proof A.1 Proofs of Results on ESD of Sequence Models Proof of Theorem 3.2.For anyν >0, define kΛ(ν) = #{j:λ j ≥ν}, which counts how many eigenvalues exceed the thresholdν. The KPCR estimator s...
-
[56]
Let Bn = {π1, . . . , πKn}. We define the collection of hypercubes verticesV = {−1, 1}K. For every vertexv∈ V , define a parameter vector θ(v) = (θ(v) j )d j=1 as in Equation (32). There are2K such vectors {θ(v)}. For eachv∈ V , let Pv be the sampling distribution of the sequence model in Equation (4) withθ∗ = θ(v), σ2 = σ2 0/n, and {ξj}j∈[d] being i.i.d....
-
[57]
We can then apply the argument in Theorem 8.3 withσ2 replaced by¯σ2/n
Therefore, σ2 ≤¯σ2/n. We can then apply the argument in Theorem 8.3 withσ2 replaced by¯σ2/n. Lower bound: We establish the lower bound using Assouad’s method. Let m = ⌊c1K⌋. Consider the firstm eigenfunctions {ψπj }j≤m corresponding to the largest eigenvalues {λπj }j≤m. Define the collection of hypercubes verticesV = {−1, 1}m. For every vertexv∈ V, define...
-
[58]
It then follows that∥f(v)∥2 ∞ ≤σ 2 0C2 0
, σ2 0C2 0 C1κ2 . It then follows that∥f(v)∥2 ∞ ≤σ 2 0C2 0. For eachv∈ V , letPv be the sampling distribution of{zi = (xi, yi)}i≤n from the regression model Equation (20) withf ∗ = f (v). Let ρ be the Hamming distance onV. If v and w∈ V differ in exactly one coordinate (i.e.,ρ(v, w) = 1), then •∥f (v) −f (w)∥2 L2(µ) ≥(2γ) 2, and 35 • the Kullback-Leibler ...
-
[59]
We choose γ2 =n −1 min σ2 0, ¯σ2 4(1 +C 2
Furthermore, ifmγ2 ≤ (2n)−1σ2 0Kn, we can the same argument in Theorem A.2 (in particular, using Theorem 4.1 to dervie Equation (33)) to show thatf (v) ∈ FK,k. We choose γ2 =n −1 min σ2 0, ¯σ2 4(1 +C 2
-
[60]
, σ2 0C2 0 C1κ2 , which impliesf (v) ∈ FK,k. 36 We then follows the same argument in the proof of lower bound in Theorem 8.5 to obtain sup v∈V Ev bθ−θ (v) 2 ≥m (2γ)2 4 =c σ2 0K n , wherecis a constant that depends onC 0,κ,c 1,C 1. C Proofs for Result on overparameterized gradient flow In this section, we prove Theorem 5.2. The high-level idea is as follow...
-
[61]
We have y≥θ(t)≥0∀t≥0
-
[62]
4.|θ ∗ −θ(t)|is decreasing provided that|θ ∗ −θ(t)|> κ
Sincey=θ ∗ +ξand|ξ| ≤κ, we have |θ∗ −θ(t)| ≤ |θ ∗|+κ,∀t≥0. 4.|θ ∗ −θ(t)|is decreasing provided that|θ ∗ −θ(t)|> κ
-
[63]
Ify <0, Items 3, 4, and 5 still hold, while Items 1 and 2 can be modified by symmetry
If|θ ∗ −θ(t 1)| ≤κfor somet 1, we have |θ∗ −θ(t)| ≤κfor allt≥t 1. Ify <0, Items 3, 4, and 5 still hold, while Items 1 and 2 can be modified by symmetry. Proof.Items 1 and 2 are directly implied from Equation (38). Item 3 is implied by Item 2. To prove Item 4, consider Equation (52), from which we have ˙θ=θ 2 a−2 +D 2b−2 +β −2 (y−θ), which implies ˙θ≥0. Si...
-
[64]
Furthermore, Equation (65) implies that T1 ≥inf t≥0 : √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)ds≥r ′
Consequently, we haveb2(t)≤(1 + 1 D)b2 0, and thus |β(t)| ≤ √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)dt,(65) which implies Equation (64) by using the fact that|θ| = |abDβ|. Furthermore, Equation (65) implies that T1 ≥inf t≥0 : √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)ds≥r ′ . Then when t > T 1, in the following, supposea0 ≤b 0/D. We haver′ = a0 and R′ = b0/D. Consider t∈ (T1, T2). W...
-
[65]
By Grönwall inequality, we have |β(t)| ≤a 0 exp √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds , t∈(T 1, T2)
Consequently, Equation (38) implies that |β(t)| ≤a 0 + √ 2ebD 0 Z t T1 |β(s)|(|θ ∗|+|ξ|)ds,(66) for anyt∈(T 1, T2). By Grönwall inequality, we have |β(t)| ≤a 0 exp √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds , t∈(T 1, T2). 48 By definition ofT2, we have T2 ≥inf t≥T 1 : √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds= ln b0/D a0 .(67) The bound forθ(t)now follows from using the boundsa(t) ...
-
[66]
If ˙x(t)≥k x(t), x(0) =x 0, then for allt≥0, it holds that x(t)≥x 0 ekt, and for everyM≥x 0, we have inf{t≥0 :x(t)≥M} ≤ 1 k log M x0
-
[67]
If ˙x(t)≤ −k x(t), x(0) =x 0, then for allt≥0, it holds that x(t)≤x 0 e−kt, and for every0< M≤x 0, we have inf{t≥0 :x(t)≤M} ≤ 1 k log x0 M . D Related Work on Principal Component Regression As discussed in Section 3.1, the PC estimator serves as a motivating example for the concepts of ESD and span profile due to its clear illustration of bias-variance tr...
-
[68]
extend the analysis to a general covariance matrixΣx and an anisotropic prior satisfying 51 Eβ∗β⊤ ∗ =Σ β. They also derive an exact risk expression and demonstrate how “misalignment” betweenΣ x andΣ β affects risk; here “alignment” refers to concordance between the orderings of their eigenvalues. Both studies assume knowledge of the eigenvectors of the po...
-
[69]
Hucker and Wahl [22] derive non-asymptotic error bounds for PCR in kernel regression
Huang et al.[21] derive non-asymptotic risk bounds for PCR in more general settings by analyzing the alignment between population and empirical principal components. Hucker and Wahl [22] derive non-asymptotic error bounds for PCR in kernel regression. Large Language Models Statement Large language models were used to improve the clarity of the manuscript ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.