Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

Dongming Huang; Qian Lin; Yicheng Li; Zhifan Li

arxiv: 2509.20294 · v4 · submitted 2025-09-24 · 💻 cs.LG · math.ST· stat.TH

Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

Dongming Huang , Zhifan Li , Yicheng Li , Qian Lin This is my paper

Pith reviewed 2026-05-18 13:51 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH

keywords effective span dimensionspectral algorithmslearned kernelsminimax ratesgradient flowgeneralizationsequence models

0 comments

The pith

For sequence models with effective span dimension at most K the minimax excess risk of spectral algorithms scales as sigma squared times K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines an effective span dimension that captures alignment between the target signal, the kernel spectrum, and the noise level without needing eigenvalue decay or source conditions. It proves that when this dimension is bounded by K in a sequence model, the lowest achievable excess risk for any spectral algorithm is proportional to sigma squared K. The work further shows that running over-parameterized gradient flow on the kernel can shrink the effective span dimension. A reader would care because the result gives a direct generalization guarantee for kernels that are adapted from data rather than fixed in advance.

Core claim

For sequence models whose effective span dimension is at most K, the minimax excess risk of spectral algorithms scales as sigma squared K. Over-parameterized gradient flow can reduce the effective span dimension, establishing a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms.

What carries the argument

The effective span dimension (ESD), an alignment-sensitive measure of problem complexity that depends jointly on the signal, spectrum, and noise level to bound the minimax risk.

If this is right

The sigma squared K rate applies to learned kernels in sequence models without extra decay assumptions.
Gradient flow training lowers the effective span dimension and therefore tightens the generalization bound.
The same ESD framework yields rates for linear models and ordinary RKHS regression.
Numerical checks confirm that the predicted scaling holds in simulated sequence settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the ESD turns out to be small after training on real data, learned kernels could systematically beat fixed kernels on generalization.
The reduction of ESD by gradient flow may supply a theoretical account for why over-parameterized models often generalize despite high capacity.
One could design training procedures that explicitly minimize the ESD instead of the usual loss.
The measure might extend to other adaptive representations such as deep networks to predict their effective complexity.

Load-bearing premise

The problem must be a sequence model in which the effective span dimension is well-defined and finite without any eigen-decay or source conditions.

What would settle it

Fix a sequence model and kernel, compute its effective span dimension K, train the spectral algorithm, and check whether the observed excess risk is within a constant factor of sigma squared K across several noise levels.

Figures

Figures reproduced from arXiv: 2509.20294 by Dongming Huang, Qian Lin, Yicheng Li, Zhifan Li.

**Figure 2.** Figure 2: Averaged squared error of the tuned PC estimator and ESD as a function of the training [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Oracle PCR risk versus Effective Span Dimension for (a) geometric eigen-decay and (b) logarithmic eigen-decay. The dashed line plots Risk × n/σ2 0 ; the solid line is d † (α). The risk is computed based on 20 replications and the error bar represents the standard deviation. 8.1 RKHS regression We recall the standard random-design kernel regression model from Section 2: yi = f ∗ (xi) + ϵi , ϵi are i.i.d. wi… view at source ↗

**Figure 4.** Figure 4: Effective Span Dimension and Optimal KPCPE risk. The dashed line plots Risk [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Pathwise ESD and risk under a learned kernel using a 4-layer linear network. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

read the original abstract

We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $\sigma^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $\sigma^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces the effective span dimension to obtain alignment-sensitive minimax rates for spectral algorithms with learned kernels without eigen-decay or source conditions.

read the letter

The main point is that this paper defines an effective span dimension that captures alignment effects in learned kernels and delivers minimax rates of sigma squared times K for sequence models without needing eigen-decay or source conditions. Gradient flow is shown to reduce this dimension. What the paper does well is to make the complexity measure depend jointly on signal, spectrum, and noise so that it stays finite for arbitrary kernels. The proof that the excess risk scales with this dimension follows from standard spectral arguments once the definition is in place. They also establish that over-parameterized gradient flow shrinks the ESD, which ties adaptive learning to better rates. The extensions to linear models and RKHS plus the experiments give the framework some reach. The soft spots are limited. The central claims rest on the sequence model, which is a reasonable starting point but leaves open how the results translate to more general regression settings. Since the full derivations are not in the abstract, one would want to check the constants and any hidden assumptions in the proofs. Nothing in the structure looks circular or self-referential. This paper is for statistical learning theorists who want complexity measures that handle learned features rather than fixed kernels. Someone working on spectral methods or generalization in overparameterized models would get value from the ESD and the gradient flow link. I would recommend sending it to peer review. The work is coherent on its own terms and addresses a clear limitation in existing theory.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the effective span dimension (ESD), an alignment-sensitive complexity measure depending jointly on signal, spectrum, and noise level, for spectral algorithms with learned kernels. It proves that sequence models with ESD at most K have minimax excess risk scaling as σ²K, shows that over-parameterized gradient flow reduces the ESD, and extends the framework to linear models and RKHS regression with numerical experiments.

Significance. If the results hold, the ESD offers a novel perspective on generalization for learned kernels that avoids eigen-decay and source conditions, crediting the work for its internal consistency (the rate follows directly from the ESD definition plus standard spectral arguments) and for establishing a verifiable link between gradient-flow dynamics and reduced complexity. The extensions and experiments strengthen the generality claim.

minor comments (2)

[§2] §2 (sequence model setup): explicitly list the minimal conditions under which the ESD remains finite and well-defined for arbitrary kernels, to address potential reader questions about unstated measurability requirements.
[gradient flow analysis] The gradient-flow reduction claim would be strengthened by a brief remark on how the over-parameterization ratio interacts with the noise level σ² in the ESD update.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of the ESD framework, and the recommendation for minor revision. We appreciate the recognition of the internal consistency of the bounds, the novel alignment-sensitive perspective that avoids eigen-decay and source conditions, and the verifiable connection between gradient-flow dynamics and reduced complexity.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the effective span dimension (ESD) as an alignment-sensitive measure jointly depending on signal, spectrum, and noise level, without eigen-decay or source conditions. It then proves that minimax excess risk scales as sigma^2 times ESD (or at most sigma^2 K when ESD <= K) for sequence models, using standard spectral bias arguments. This is not a self-definitional reduction because ESD is constructed as an independent complexity measure whose finiteness holds for arbitrary kernels, and the risk bound follows from explicit analysis rather than by tautological construction. The gradient-flow reduction of ESD is a separate dynamical property. No load-bearing self-citations, fitted inputs renamed as predictions, or ansatz smuggling appear; the framework extends independently to linear models and RKHS regression. The chain is verifiable against external benchmarks and contains independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the sequence model being a valid abstraction for the learned-kernel setting and on the ESD being finite and well-defined for arbitrary kernels without additional decay assumptions.

axioms (1)

domain assumption Sequence model setting suffices to capture learned-kernel spectral algorithms
Abstract states results for sequence models whose ESD is at most K

invented entities (1)

Effective span dimension (ESD) no independent evidence
purpose: Alignment-sensitive complexity measure depending on signal, spectrum, and noise
New quantity introduced to replace traditional effective dimension

pith-pipeline@v0.9.0 · 5682 in / 1255 out tokens · 27174 ms · 2026-05-18T13:51:17.152549+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 3.1 … d† = min{k : (1/k) ∑_{i>k} (θ_πi^*)² ≤ σ²}
IndisputableMonolith/Foundation/DimensionForcing.lean dimension_forcing_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.3 … inf sup R(bθ,θ*) ≍ K σ² over F_K,λ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

[1]

Agarwal, D

A. Agarwal, D. Shah, D. Shen, and D. Song. On robustness of principal component regression.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[2]

A Convergence Theory for Deep Learning via Over-Parameterization

Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over- parameterization, June 2019. URLhttp://arxiv.org/abs/1811.03962

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Arora, S

S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019. 25

work page 2019
[4]

J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, and G. Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation, May 2022

work page 2022
[5]

P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, Dec. 2020. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1907378117

work page doi:10.1073/pnas.1907378117 2020
[6]

P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020
[7]

Barzilai and O

D. Barzilai and O. Shamir. Generalization in kernel regression under realistic assumptions. arXiv preprint arXiv:2312.15995, 2023

work page arXiv 2023
[8]

Bauer, S

F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007. doi: 10.1016/j.jco.2006.07.001

work page doi:10.1016/j.jco.2006.07.001 2007
[9]

X. Bing, F. Bunea, S. Strimas-Mackey, and M. Wegkamp. Prediction under latent factor regression: Adaptive pcr, interpolating predictors and beyond.Journal of Machine Learning Research, 22(177):1–50, 2021

work page 2021
[10]

Bordelon, A

B. Bordelon, A. Atanasov, and C. Pehlevan. How feature learning can improve neural scaling laws. InThe Thirteenth International Conference on Learning Representations,

work page
[11]

URLhttps://openreview.net/forum?id=dEypApI1MZ

work page
[12]

L. D. Brown, T. T. Cai, M. G. Low, and C.-H. Zhang. Asymptotic equivalence theory for nonparametric regression with random design.The Annals of Statistics, 30(3):688–707,

work page
[13]

doi: 10.1214/aos/1028674838

ISSN 0090-5364. doi: 10.1214/aos/1028674838

work page doi:10.1214/aos/1028674838
[14]

Caponnetto and E

A. Caponnetto and E. D. Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007

work page 2007
[15]

Cortes, M

C. Cortes, M. Mohri, and A. Rostamizadeh. Algorithms for learning kernels based on centered alignment.The Journal of Machine Learning Research, 13(1):795–828, 2012

work page 2012
[16]

Cristianini, J

N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. Advances in neural information processing systems, 14, 2001

work page 2001
[17]

H. W. Engl, M. Hanke, and A. Neubauer.Regularization of Inverse Problems, volume

work page
[18]

Springer Science & Business Media, 1996

work page 1996
[19]

Gatmiry, S

K. Gatmiry, S. Jegelka, and J. Kelner. Optimization and Adaptive Generalization of Three layer Neural Networks. InInternational Conference on Learning Representations, Oct. 2021. URLhttps://openreview.net/forum?id=dPyRNUlttBv

work page 2021
[20]

Gedon, A

D. Gedon, A. H. Ribeiro, and T. B. Schön. No double descent in principal component regression: A high-dimensional analysis. InForty-first International Conference on Machine Learning, 2024

work page 2024
[21]

L. L. Gerfo, L. Rosasco, F. Odone, E. D. Vito, and A. Verri. Spectral algorithms for supervised learning.Neural Computation, 20(7):1873–1897, 2008. 26

work page 2008
[22]

Ghorbani, S

B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33: 14820–14830, 2020

work page 2020
[23]

Green and E

A. Green and E. Romanov. The high-dimensional asymptotics of principal component regression.arXiv preprint arXiv:2405.11676, 2024

work page arXiv 2024
[24]

N. T. Huang, D. W. Hogg, and S. Villar. Dimensionality reduction, regularization, and generalization in overparameterized regressions.SIAM Journal on Mathematics of Data Science, 4(1):126–152, 2022

work page 2022
[25]

Hucker and M

L. Hucker and M. Wahl. A note on the prediction error of principal component regression in high dimensions.Theory of Probability and Mathematical Statistics, 109:37–53, 2023

work page 2023
[26]

Jacot, F

A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and gen- eralization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URLhttps://proceedings.neurips. cc/paper/2018/file/5a4be...

work page 2018
[27]

I. M. Johnstone. Gaussian estimation: Sequence and wavelet models. 2017

work page 2017
[28]

S. Karp, E. Winston, Y. Li, and A. Singh. Local signal adaptivity: Provable feature learning in neural networks beyond kernels.Advances in Neural Information Processing Systems, 34:24883–24897, 2021

work page 2021
[29]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representa- tions revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019

work page 2019
[30]

Kunin, A

D. Kunin, A. Raventós, C. Dominé, F. Chen, D. Klindt, A. Saxe, and S. Ganguli. Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning.Advances in Neural Information Processing Systems, 37:81157–81203, 2024

work page 2024
[31]

Li and Q

Y. Li and Q. Lin. Improving adaptivity via over-parameterization in sequence models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=UfLH4T676K

work page 2024
[32]

Li and Q

Y. Li and Q. Lin. Diagonal over-parameterization in reproducing kernel hilbert spaces as an adaptive feature model: Generalization and adaptivity.arXiv preprint arXiv:2501.08679, 2025

work page arXiv 2025
[33]

Y. Li, W. Gan, Z. Shi, and Q. Lin. Generalization error curves for analytic spectral algorithms under power-law decay.arXiv preprint arXiv:2401.01599, 2024

work page arXiv 2024
[34]

Y. H. Liu, A. Baratin, J. Cornford, S. Mihalas, E. T. SheaBrown, and G. Lajoie. How connectivity structure shapes rich and lazy learning in neural circuits. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[35]

Mathé and S

P. Mathé and S. V. Pereverzev. Geometry of linear ill-posed problems in variable hilbert scales.Inverse problems, 19(3):789, 2003. 27

work page 2003
[36]

Radhakrishnan, D

A. Radhakrishnan, D. Beaglehole, P. Pandit, and M. Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383(6690):1461–1467, 2024

work page 2024
[37]

Rosasco, E

L. Rosasco, E. De Vito, and A. Verri. Spectral methods for regularization in learning theory.DISI, Universita degli Studi di Genova, Italy, Technical Report DISI-TR-05-18, 2005

work page 2005
[38]

Schölkopf and A

B. Schölkopf and A. J. Smola.Learning with Kernels: Support Vector Machines, Regular- ization, Optimization, and Beyond. MIT Press, 2002

work page 2002
[39]

Seleznova and G

M. Seleznova and G. Kutyniok. Analyzing finite neural networks: Can we trust neural tangent kernel theory? InMathematical and Scientific Machine Learning, pages 868–895. PMLR, 2022. URLhttps://proceedings.mlr.press/v145/seleznova22a.html

work page 2022
[40]

Z. Shi, J. Wei, and Y. Liang. Provable guarantees for neural networks via gradient feature learning.Advances in Neural Information Processing Systems, 36:55848–55918, 2023

work page 2023
[41]

Steinwart and A

I. Steinwart and A. Christmann.Support vector machines. Springer Science & Business Media, 2008

work page 2008
[42]

Steinwart and C

I. Steinwart and C. Scovel. Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs. 2012. doi: 10.1007/S00365-012-9153-3

work page doi:10.1007/s00365-012-9153-3 2012
[43]

Tsigler and P

A. Tsigler and P. L. Bartlett. Benign overfitting in ridge regression.Journal of Machine Learning Research, 24(123):1–76, 2023

work page 2023
[44]

Wahba.Spline Models for Observational Data, volume 59 ofCBMS-NSF Regional Conference Series in Applied Mathematics

G. Wahba.Spline Models for Observational Data, volume 59 ofCBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, 1990

work page 1990
[45]

C. Wang, X. He, Y. Wang, and J. Wang. On the target-kernel alignment: a unified analysis with kernel complexity.Advances in Neural Information Processing Systems, 37: 40434–40485, 2024

work page 2024
[46]

Wenger, F

J. Wenger, F. Dangel, and A. Kristiadi. On the disconnect between theory and practice of overparametrized neural networks, Sept. 2023. URLhttp://arxiv.org/abs/2310.00137

work page arXiv 2023
[47]

Woodworth, S

B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. InProceedings of Thirty Third Conference on Learning Theory, pages 3635–3673. PMLR, July 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html

work page 2020
[48]

Wu and J

D. Wu and J. Xu. On the optimal weightedℓ2 regularization in overparameterized linear regression.Advances in Neural Information Processing Systems, 33:10112–10123, 2020

work page 2020
[49]

Xu and D

J. Xu and D. J. Hsu. On the number of variables to use in principal component regression. Advances in neural information processing systems, 32, 2019

work page 2019
[50]

Xu and L

Y. Xu and L. Ziyin. Three mechanisms of feature learning in a linear network. InThe Thirteenth International Conference on Learning Representations, 2025. 28

work page 2025
[51]

Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26:289–315, Aug. 2007. doi: 10.1007/s00365-006-0663-2

work page doi:10.1007/s00365-006-0663-2 2007
[52]

B. Yu. Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics, pages 423–435. Springer, 1997

work page 1997
[53]

Zhang, Y

H. Zhang, Y. Li, and Q. Lin. On the optimality of misspecified spectral algorithms, Mar. 2023

work page 2023
[54]

Zhang, J

H. Zhang, J. Lai, Y. Li, Q. Lin, and J. S. Liu. Towards a statistical understanding of neural networks: Beyond the neural tangent kernel theories.arXiv preprint arXiv:2412.18756, 2024

work page arXiv 2024
[55]

T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005. doi: 10.1162/0899766054323008. 29 A Proof A.1 Proofs of Results on ESD of Sequence Models Proof of Theorem 3.2.For anyν >0, define kΛ(ν) = #{j:λ j ≥ν}, which counts how many eigenvalues exceed the thresholdν. The KPCR estimator s...

work page doi:10.1162/0899766054323008 2077
[56]

Let Bn = {π1, . . . , πKn}. We define the collection of hypercubes verticesV = {−1, 1}K. For every vertexv∈ V , define a parameter vector θ(v) = (θ(v) j )d j=1 as in Equation (32). There are2K such vectors {θ(v)}. For eachv∈ V , let Pv be the sampling distribution of the sequence model in Equation (4) withθ∗ = θ(v), σ2 = σ2 0/n, and {ξj}j∈[d] being i.i.d....

work page
[57]

We can then apply the argument in Theorem 8.3 withσ2 replaced by¯σ2/n

Therefore, σ2 ≤¯σ2/n. We can then apply the argument in Theorem 8.3 withσ2 replaced by¯σ2/n. Lower bound: We establish the lower bound using Assouad’s method. Let m = ⌊c1K⌋. Consider the firstm eigenfunctions {ψπj }j≤m corresponding to the largest eigenvalues {λπj }j≤m. Define the collection of hypercubes verticesV = {−1, 1}m. For every vertexv∈ V, define...

work page
[58]

It then follows that∥f(v)∥2 ∞ ≤σ 2 0C2 0

, σ2 0C2 0 C1κ2 . It then follows that∥f(v)∥2 ∞ ≤σ 2 0C2 0. For eachv∈ V , letPv be the sampling distribution of{zi = (xi, yi)}i≤n from the regression model Equation (20) withf ∗ = f (v). Let ρ be the Hamming distance onV. If v and w∈ V differ in exactly one coordinate (i.e.,ρ(v, w) = 1), then •∥f (v) −f (w)∥2 L2(µ) ≥(2γ) 2, and 35 • the Kullback-Leibler ...

work page
[59]

We choose γ2 =n −1 min σ2 0, ¯σ2 4(1 +C 2

Furthermore, ifmγ2 ≤ (2n)−1σ2 0Kn, we can the same argument in Theorem A.2 (in particular, using Theorem 4.1 to dervie Equation (33)) to show thatf (v) ∈ FK,k. We choose γ2 =n −1 min σ2 0, ¯σ2 4(1 +C 2

work page
[60]

, σ2 0C2 0 C1κ2 , which impliesf (v) ∈ FK,k. 36 We then follows the same argument in the proof of lower bound in Theorem 8.5 to obtain sup v∈V Ev bθ−θ (v) 2 ≥m (2γ)2 4 =c σ2 0K n , wherecis a constant that depends onC 0,κ,c 1,C 1. C Proofs for Result on overparameterized gradient flow In this section, we prove Theorem 5.2. The high-level idea is as follow...

work page
[61]

We have y≥θ(t)≥0∀t≥0

work page
[62]

4.|θ ∗ −θ(t)|is decreasing provided that|θ ∗ −θ(t)|> κ

Sincey=θ ∗ +ξand|ξ| ≤κ, we have |θ∗ −θ(t)| ≤ |θ ∗|+κ,∀t≥0. 4.|θ ∗ −θ(t)|is decreasing provided that|θ ∗ −θ(t)|> κ

work page
[63]

Ify <0, Items 3, 4, and 5 still hold, while Items 1 and 2 can be modified by symmetry

If|θ ∗ −θ(t 1)| ≤κfor somet 1, we have |θ∗ −θ(t)| ≤κfor allt≥t 1. Ify <0, Items 3, 4, and 5 still hold, while Items 1 and 2 can be modified by symmetry. Proof.Items 1 and 2 are directly implied from Equation (38). Item 3 is implied by Item 2. To prove Item 4, consider Equation (52), from which we have ˙θ=θ 2 a−2 +D 2b−2 +β −2 (y−θ), which implies ˙θ≥0. Si...

work page
[64]

Furthermore, Equation (65) implies that T1 ≥inf t≥0 : √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)ds≥r ′

Consequently, we haveb2(t)≤(1 + 1 D)b2 0, and thus |β(t)| ≤ √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)dt,(65) which implies Equation (64) by using the fact that|θ| = |abDβ|. Furthermore, Equation (65) implies that T1 ≥inf t≥0 : √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)ds≥r ′ . Then when t > T 1, in the following, supposea0 ≤b 0/D. We haver′ = a0 and R′ = b0/D. Consider t∈ (T1, T2). W...

work page
[65]

By Grönwall inequality, we have |β(t)| ≤a 0 exp √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds , t∈(T 1, T2)

Consequently, Equation (38) implies that |β(t)| ≤a 0 + √ 2ebD 0 Z t T1 |β(s)|(|θ ∗|+|ξ|)ds,(66) for anyt∈(T 1, T2). By Grönwall inequality, we have |β(t)| ≤a 0 exp √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds , t∈(T 1, T2). 48 By definition ofT2, we have T2 ≥inf t≥T 1 : √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds= ln b0/D a0 .(67) The bound forθ(t)now follows from using the boundsa(t) ...

work page
[66]

If ˙x(t)≥k x(t), x(0) =x 0, then for allt≥0, it holds that x(t)≥x 0 ekt, and for everyM≥x 0, we have inf{t≥0 :x(t)≥M} ≤ 1 k log M x0

work page
[67]

double-descent

If ˙x(t)≤ −k x(t), x(0) =x 0, then for allt≥0, it holds that x(t)≤x 0 e−kt, and for every0< M≤x 0, we have inf{t≥0 :x(t)≤M} ≤ 1 k log x0 M . D Related Work on Principal Component Regression As discussed in Section 3.1, the PC estimator serves as a motivating example for the concepts of ESD and span profile due to its clear illustration of bias-variance tr...

work page
[68]

misalignment

extend the analysis to a general covariance matrixΣx and an anisotropic prior satisfying 51 Eβ∗β⊤ ∗ =Σ β. They also derive an exact risk expression and demonstrate how “misalignment” betweenΣ x andΣ β affects risk; here “alignment” refers to concordance between the orderings of their eigenvalues. Both studies assume knowledge of the eigenvectors of the po...

work page
[69]

Hucker and Wahl [22] derive non-asymptotic error bounds for PCR in kernel regression

Huang et al.[21] derive non-asymptotic risk bounds for PCR in more general settings by analyzing the alignment between population and empirical principal components. Hucker and Wahl [22] derive non-asymptotic error bounds for PCR in kernel regression. Large Language Models Statement Large language models were used to improve the clarity of the manuscript ...

work page

[1] [1]

Agarwal, D

A. Agarwal, D. Shah, D. Shen, and D. Song. On robustness of principal component regression.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[2] [2]

A Convergence Theory for Deep Learning via Over-Parameterization

Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over- parameterization, June 2019. URLhttp://arxiv.org/abs/1811.03962

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

Arora, S

S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019. 25

work page 2019

[4] [4]

J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, and G. Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation, May 2022

work page 2022

[5] [5]

P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, Dec. 2020. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1907378117

work page doi:10.1073/pnas.1907378117 2020

[6] [6]

P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020

[7] [7]

Barzilai and O

D. Barzilai and O. Shamir. Generalization in kernel regression under realistic assumptions. arXiv preprint arXiv:2312.15995, 2023

work page arXiv 2023

[8] [8]

Bauer, S

F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007. doi: 10.1016/j.jco.2006.07.001

work page doi:10.1016/j.jco.2006.07.001 2007

[9] [9]

X. Bing, F. Bunea, S. Strimas-Mackey, and M. Wegkamp. Prediction under latent factor regression: Adaptive pcr, interpolating predictors and beyond.Journal of Machine Learning Research, 22(177):1–50, 2021

work page 2021

[10] [10]

Bordelon, A

B. Bordelon, A. Atanasov, and C. Pehlevan. How feature learning can improve neural scaling laws. InThe Thirteenth International Conference on Learning Representations,

work page

[11] [11]

URLhttps://openreview.net/forum?id=dEypApI1MZ

work page

[12] [12]

L. D. Brown, T. T. Cai, M. G. Low, and C.-H. Zhang. Asymptotic equivalence theory for nonparametric regression with random design.The Annals of Statistics, 30(3):688–707,

work page

[13] [13]

doi: 10.1214/aos/1028674838

ISSN 0090-5364. doi: 10.1214/aos/1028674838

work page doi:10.1214/aos/1028674838

[14] [14]

Caponnetto and E

A. Caponnetto and E. D. Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007

work page 2007

[15] [15]

Cortes, M

C. Cortes, M. Mohri, and A. Rostamizadeh. Algorithms for learning kernels based on centered alignment.The Journal of Machine Learning Research, 13(1):795–828, 2012

work page 2012

[16] [16]

Cristianini, J

N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. Advances in neural information processing systems, 14, 2001

work page 2001

[17] [17]

H. W. Engl, M. Hanke, and A. Neubauer.Regularization of Inverse Problems, volume

work page

[18] [18]

Springer Science & Business Media, 1996

work page 1996

[19] [19]

Gatmiry, S

K. Gatmiry, S. Jegelka, and J. Kelner. Optimization and Adaptive Generalization of Three layer Neural Networks. InInternational Conference on Learning Representations, Oct. 2021. URLhttps://openreview.net/forum?id=dPyRNUlttBv

work page 2021

[20] [20]

Gedon, A

D. Gedon, A. H. Ribeiro, and T. B. Schön. No double descent in principal component regression: A high-dimensional analysis. InForty-first International Conference on Machine Learning, 2024

work page 2024

[21] [21]

L. L. Gerfo, L. Rosasco, F. Odone, E. D. Vito, and A. Verri. Spectral algorithms for supervised learning.Neural Computation, 20(7):1873–1897, 2008. 26

work page 2008

[22] [22]

Ghorbani, S

B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33: 14820–14830, 2020

work page 2020

[23] [23]

Green and E

A. Green and E. Romanov. The high-dimensional asymptotics of principal component regression.arXiv preprint arXiv:2405.11676, 2024

work page arXiv 2024

[24] [24]

N. T. Huang, D. W. Hogg, and S. Villar. Dimensionality reduction, regularization, and generalization in overparameterized regressions.SIAM Journal on Mathematics of Data Science, 4(1):126–152, 2022

work page 2022

[25] [25]

Hucker and M

L. Hucker and M. Wahl. A note on the prediction error of principal component regression in high dimensions.Theory of Probability and Mathematical Statistics, 109:37–53, 2023

work page 2023

[26] [26]

Jacot, F

A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and gen- eralization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URLhttps://proceedings.neurips. cc/paper/2018/file/5a4be...

work page 2018

[27] [27]

I. M. Johnstone. Gaussian estimation: Sequence and wavelet models. 2017

work page 2017

[28] [28]

S. Karp, E. Winston, Y. Li, and A. Singh. Local signal adaptivity: Provable feature learning in neural networks beyond kernels.Advances in Neural Information Processing Systems, 34:24883–24897, 2021

work page 2021

[29] [29]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representa- tions revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019

work page 2019

[30] [30]

Kunin, A

D. Kunin, A. Raventós, C. Dominé, F. Chen, D. Klindt, A. Saxe, and S. Ganguli. Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning.Advances in Neural Information Processing Systems, 37:81157–81203, 2024

work page 2024

[31] [31]

Li and Q

Y. Li and Q. Lin. Improving adaptivity via over-parameterization in sequence models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=UfLH4T676K

work page 2024

[32] [32]

Li and Q

Y. Li and Q. Lin. Diagonal over-parameterization in reproducing kernel hilbert spaces as an adaptive feature model: Generalization and adaptivity.arXiv preprint arXiv:2501.08679, 2025

work page arXiv 2025

[33] [33]

Y. Li, W. Gan, Z. Shi, and Q. Lin. Generalization error curves for analytic spectral algorithms under power-law decay.arXiv preprint arXiv:2401.01599, 2024

work page arXiv 2024

[34] [34]

Y. H. Liu, A. Baratin, J. Cornford, S. Mihalas, E. T. SheaBrown, and G. Lajoie. How connectivity structure shapes rich and lazy learning in neural circuits. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[35] [35]

Mathé and S

P. Mathé and S. V. Pereverzev. Geometry of linear ill-posed problems in variable hilbert scales.Inverse problems, 19(3):789, 2003. 27

work page 2003

[36] [36]

Radhakrishnan, D

A. Radhakrishnan, D. Beaglehole, P. Pandit, and M. Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383(6690):1461–1467, 2024

work page 2024

[37] [37]

Rosasco, E

L. Rosasco, E. De Vito, and A. Verri. Spectral methods for regularization in learning theory.DISI, Universita degli Studi di Genova, Italy, Technical Report DISI-TR-05-18, 2005

work page 2005

[38] [38]

Schölkopf and A

B. Schölkopf and A. J. Smola.Learning with Kernels: Support Vector Machines, Regular- ization, Optimization, and Beyond. MIT Press, 2002

work page 2002

[39] [39]

Seleznova and G

M. Seleznova and G. Kutyniok. Analyzing finite neural networks: Can we trust neural tangent kernel theory? InMathematical and Scientific Machine Learning, pages 868–895. PMLR, 2022. URLhttps://proceedings.mlr.press/v145/seleznova22a.html

work page 2022

[40] [40]

Z. Shi, J. Wei, and Y. Liang. Provable guarantees for neural networks via gradient feature learning.Advances in Neural Information Processing Systems, 36:55848–55918, 2023

work page 2023

[41] [41]

Steinwart and A

I. Steinwart and A. Christmann.Support vector machines. Springer Science & Business Media, 2008

work page 2008

[42] [42]

Steinwart and C

I. Steinwart and C. Scovel. Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs. 2012. doi: 10.1007/S00365-012-9153-3

work page doi:10.1007/s00365-012-9153-3 2012

[43] [43]

Tsigler and P

A. Tsigler and P. L. Bartlett. Benign overfitting in ridge regression.Journal of Machine Learning Research, 24(123):1–76, 2023

work page 2023

[44] [44]

Wahba.Spline Models for Observational Data, volume 59 ofCBMS-NSF Regional Conference Series in Applied Mathematics

G. Wahba.Spline Models for Observational Data, volume 59 ofCBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, 1990

work page 1990

[45] [45]

C. Wang, X. He, Y. Wang, and J. Wang. On the target-kernel alignment: a unified analysis with kernel complexity.Advances in Neural Information Processing Systems, 37: 40434–40485, 2024

work page 2024

[46] [46]

Wenger, F

J. Wenger, F. Dangel, and A. Kristiadi. On the disconnect between theory and practice of overparametrized neural networks, Sept. 2023. URLhttp://arxiv.org/abs/2310.00137

work page arXiv 2023

[47] [47]

Woodworth, S

B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. InProceedings of Thirty Third Conference on Learning Theory, pages 3635–3673. PMLR, July 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html

work page 2020

[48] [48]

Wu and J

D. Wu and J. Xu. On the optimal weightedℓ2 regularization in overparameterized linear regression.Advances in Neural Information Processing Systems, 33:10112–10123, 2020

work page 2020

[49] [49]

Xu and D

J. Xu and D. J. Hsu. On the number of variables to use in principal component regression. Advances in neural information processing systems, 32, 2019

work page 2019

[50] [50]

Xu and L

Y. Xu and L. Ziyin. Three mechanisms of feature learning in a linear network. InThe Thirteenth International Conference on Learning Representations, 2025. 28

work page 2025

[51] [51]

Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26:289–315, Aug. 2007. doi: 10.1007/s00365-006-0663-2

work page doi:10.1007/s00365-006-0663-2 2007

[52] [52]

B. Yu. Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics, pages 423–435. Springer, 1997

work page 1997

[53] [53]

Zhang, Y

H. Zhang, Y. Li, and Q. Lin. On the optimality of misspecified spectral algorithms, Mar. 2023

work page 2023

[54] [54]

Zhang, J

H. Zhang, J. Lai, Y. Li, Q. Lin, and J. S. Liu. Towards a statistical understanding of neural networks: Beyond the neural tangent kernel theories.arXiv preprint arXiv:2412.18756, 2024

work page arXiv 2024

[55] [55]

T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005. doi: 10.1162/0899766054323008. 29 A Proof A.1 Proofs of Results on ESD of Sequence Models Proof of Theorem 3.2.For anyν >0, define kΛ(ν) = #{j:λ j ≥ν}, which counts how many eigenvalues exceed the thresholdν. The KPCR estimator s...

work page doi:10.1162/0899766054323008 2077

[56] [56]

Let Bn = {π1, . . . , πKn}. We define the collection of hypercubes verticesV = {−1, 1}K. For every vertexv∈ V , define a parameter vector θ(v) = (θ(v) j )d j=1 as in Equation (32). There are2K such vectors {θ(v)}. For eachv∈ V , let Pv be the sampling distribution of the sequence model in Equation (4) withθ∗ = θ(v), σ2 = σ2 0/n, and {ξj}j∈[d] being i.i.d....

work page

[57] [57]

We can then apply the argument in Theorem 8.3 withσ2 replaced by¯σ2/n

Therefore, σ2 ≤¯σ2/n. We can then apply the argument in Theorem 8.3 withσ2 replaced by¯σ2/n. Lower bound: We establish the lower bound using Assouad’s method. Let m = ⌊c1K⌋. Consider the firstm eigenfunctions {ψπj }j≤m corresponding to the largest eigenvalues {λπj }j≤m. Define the collection of hypercubes verticesV = {−1, 1}m. For every vertexv∈ V, define...

work page

[58] [58]

It then follows that∥f(v)∥2 ∞ ≤σ 2 0C2 0

, σ2 0C2 0 C1κ2 . It then follows that∥f(v)∥2 ∞ ≤σ 2 0C2 0. For eachv∈ V , letPv be the sampling distribution of{zi = (xi, yi)}i≤n from the regression model Equation (20) withf ∗ = f (v). Let ρ be the Hamming distance onV. If v and w∈ V differ in exactly one coordinate (i.e.,ρ(v, w) = 1), then •∥f (v) −f (w)∥2 L2(µ) ≥(2γ) 2, and 35 • the Kullback-Leibler ...

work page

[59] [59]

We choose γ2 =n −1 min σ2 0, ¯σ2 4(1 +C 2

Furthermore, ifmγ2 ≤ (2n)−1σ2 0Kn, we can the same argument in Theorem A.2 (in particular, using Theorem 4.1 to dervie Equation (33)) to show thatf (v) ∈ FK,k. We choose γ2 =n −1 min σ2 0, ¯σ2 4(1 +C 2

work page

[60] [60]

, σ2 0C2 0 C1κ2 , which impliesf (v) ∈ FK,k. 36 We then follows the same argument in the proof of lower bound in Theorem 8.5 to obtain sup v∈V Ev bθ−θ (v) 2 ≥m (2γ)2 4 =c σ2 0K n , wherecis a constant that depends onC 0,κ,c 1,C 1. C Proofs for Result on overparameterized gradient flow In this section, we prove Theorem 5.2. The high-level idea is as follow...

work page

[61] [61]

We have y≥θ(t)≥0∀t≥0

work page

[62] [62]

4.|θ ∗ −θ(t)|is decreasing provided that|θ ∗ −θ(t)|> κ

Sincey=θ ∗ +ξand|ξ| ≤κ, we have |θ∗ −θ(t)| ≤ |θ ∗|+κ,∀t≥0. 4.|θ ∗ −θ(t)|is decreasing provided that|θ ∗ −θ(t)|> κ

work page

[63] [63]

Ify <0, Items 3, 4, and 5 still hold, while Items 1 and 2 can be modified by symmetry

If|θ ∗ −θ(t 1)| ≤κfor somet 1, we have |θ∗ −θ(t)| ≤κfor allt≥t 1. Ify <0, Items 3, 4, and 5 still hold, while Items 1 and 2 can be modified by symmetry. Proof.Items 1 and 2 are directly implied from Equation (38). Item 3 is implied by Item 2. To prove Item 4, consider Equation (52), from which we have ˙θ=θ 2 a−2 +D 2b−2 +β −2 (y−θ), which implies ˙θ≥0. Si...

work page

[64] [64]

Furthermore, Equation (65) implies that T1 ≥inf t≥0 : √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)ds≥r ′

Consequently, we haveb2(t)≤(1 + 1 D)b2 0, and thus |β(t)| ≤ √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)dt,(65) which implies Equation (64) by using the fact that|θ| = |abDβ|. Furthermore, Equation (65) implies that T1 ≥inf t≥0 : √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)ds≥r ′ . Then when t > T 1, in the following, supposea0 ≤b 0/D. We haver′ = a0 and R′ = b0/D. Consider t∈ (T1, T2). W...

work page

[65] [65]

By Grönwall inequality, we have |β(t)| ≤a 0 exp √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds , t∈(T 1, T2)

Consequently, Equation (38) implies that |β(t)| ≤a 0 + √ 2ebD 0 Z t T1 |β(s)|(|θ ∗|+|ξ|)ds,(66) for anyt∈(T 1, T2). By Grönwall inequality, we have |β(t)| ≤a 0 exp √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds , t∈(T 1, T2). 48 By definition ofT2, we have T2 ≥inf t≥T 1 : √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds= ln b0/D a0 .(67) The bound forθ(t)now follows from using the boundsa(t) ...

work page

[66] [66]

If ˙x(t)≥k x(t), x(0) =x 0, then for allt≥0, it holds that x(t)≥x 0 ekt, and for everyM≥x 0, we have inf{t≥0 :x(t)≥M} ≤ 1 k log M x0

work page

[67] [67]

double-descent

If ˙x(t)≤ −k x(t), x(0) =x 0, then for allt≥0, it holds that x(t)≤x 0 e−kt, and for every0< M≤x 0, we have inf{t≥0 :x(t)≤M} ≤ 1 k log x0 M . D Related Work on Principal Component Regression As discussed in Section 3.1, the PC estimator serves as a motivating example for the concepts of ESD and span profile due to its clear illustration of bias-variance tr...

work page

[68] [68]

misalignment

extend the analysis to a general covariance matrixΣx and an anisotropic prior satisfying 51 Eβ∗β⊤ ∗ =Σ β. They also derive an exact risk expression and demonstrate how “misalignment” betweenΣ x andΣ β affects risk; here “alignment” refers to concordance between the orderings of their eigenvalues. Both studies assume knowledge of the eigenvectors of the po...

work page

[69] [69]

Hucker and Wahl [22] derive non-asymptotic error bounds for PCR in kernel regression

Huang et al.[21] derive non-asymptotic risk bounds for PCR in more general settings by analyzing the alignment between population and empirical principal components. Hucker and Wahl [22] derive non-asymptotic error bounds for PCR in kernel regression. Large Language Models Statement Large language models were used to improve the clarity of the manuscript ...

work page