pith. sign in

arxiv: 2509.20294 · v4 · submitted 2025-09-24 · 💻 cs.LG · math.ST· stat.TH

Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

Pith reviewed 2026-05-18 13:51 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH
keywords effective span dimensionspectral algorithmslearned kernelsminimax ratesgradient flowgeneralizationsequence models
0
0 comments X

The pith

For sequence models with effective span dimension at most K the minimax excess risk of spectral algorithms scales as sigma squared times K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines an effective span dimension that captures alignment between the target signal, the kernel spectrum, and the noise level without needing eigenvalue decay or source conditions. It proves that when this dimension is bounded by K in a sequence model, the lowest achievable excess risk for any spectral algorithm is proportional to sigma squared K. The work further shows that running over-parameterized gradient flow on the kernel can shrink the effective span dimension. A reader would care because the result gives a direct generalization guarantee for kernels that are adapted from data rather than fixed in advance.

Core claim

For sequence models whose effective span dimension is at most K, the minimax excess risk of spectral algorithms scales as sigma squared K. Over-parameterized gradient flow can reduce the effective span dimension, establishing a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms.

What carries the argument

The effective span dimension (ESD), an alignment-sensitive measure of problem complexity that depends jointly on the signal, spectrum, and noise level to bound the minimax risk.

If this is right

  • The sigma squared K rate applies to learned kernels in sequence models without extra decay assumptions.
  • Gradient flow training lowers the effective span dimension and therefore tightens the generalization bound.
  • The same ESD framework yields rates for linear models and ordinary RKHS regression.
  • Numerical checks confirm that the predicted scaling holds in simulated sequence settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the ESD turns out to be small after training on real data, learned kernels could systematically beat fixed kernels on generalization.
  • The reduction of ESD by gradient flow may supply a theoretical account for why over-parameterized models often generalize despite high capacity.
  • One could design training procedures that explicitly minimize the ESD instead of the usual loss.
  • The measure might extend to other adaptive representations such as deep networks to predict their effective complexity.

Load-bearing premise

The problem must be a sequence model in which the effective span dimension is well-defined and finite without any eigen-decay or source conditions.

What would settle it

Fix a sequence model and kernel, compute its effective span dimension K, train the spectral algorithm, and check whether the observed excess risk is within a constant factor of sigma squared K across several noise levels.

Figures

Figures reproduced from arXiv: 2509.20294 by Dongming Huang, Qian Lin, Yicheng Li, Zhifan Li.

Figure 1
Figure 1. Figure 1: Evolution of span profiles during the training of an over-parameterized gradient flow. The [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Averaged squared error of the tuned PC estimator and ESD as a function of the training [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Oracle PCR risk versus Effective Span Dimension for (a) geometric eigen-decay and (b) logarithmic eigen-decay. The dashed line plots Risk × n/σ2 0 ; the solid line is d † (α). The risk is computed based on 20 replications and the error bar represents the standard deviation. 8.1 RKHS regression We recall the standard random-design kernel regression model from Section 2: yi = f ∗ (xi) + ϵi , ϵi are i.i.d. wi… view at source ↗
Figure 4
Figure 4. Figure 4: Effective Span Dimension and Optimal KPCPE risk. The dashed line plots Risk [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pathwise ESD and risk under a learned kernel using a 4-layer linear network. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
read the original abstract

We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $\sigma^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $\sigma^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the effective span dimension (ESD), an alignment-sensitive complexity measure depending jointly on signal, spectrum, and noise level, for spectral algorithms with learned kernels. It proves that sequence models with ESD at most K have minimax excess risk scaling as σ²K, shows that over-parameterized gradient flow reduces the ESD, and extends the framework to linear models and RKHS regression with numerical experiments.

Significance. If the results hold, the ESD offers a novel perspective on generalization for learned kernels that avoids eigen-decay and source conditions, crediting the work for its internal consistency (the rate follows directly from the ESD definition plus standard spectral arguments) and for establishing a verifiable link between gradient-flow dynamics and reduced complexity. The extensions and experiments strengthen the generality claim.

minor comments (2)
  1. [§2] §2 (sequence model setup): explicitly list the minimal conditions under which the ESD remains finite and well-defined for arbitrary kernels, to address potential reader questions about unstated measurability requirements.
  2. [gradient flow analysis] The gradient-flow reduction claim would be strengthened by a brief remark on how the over-parameterization ratio interacts with the noise level σ² in the ESD update.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of the ESD framework, and the recommendation for minor revision. We appreciate the recognition of the internal consistency of the bounds, the novel alignment-sensitive perspective that avoids eigen-decay and source conditions, and the verifiable connection between gradient-flow dynamics and reduced complexity.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the effective span dimension (ESD) as an alignment-sensitive measure jointly depending on signal, spectrum, and noise level, without eigen-decay or source conditions. It then proves that minimax excess risk scales as sigma^2 times ESD (or at most sigma^2 K when ESD <= K) for sequence models, using standard spectral bias arguments. This is not a self-definitional reduction because ESD is constructed as an independent complexity measure whose finiteness holds for arbitrary kernels, and the risk bound follows from explicit analysis rather than by tautological construction. The gradient-flow reduction of ESD is a separate dynamical property. No load-bearing self-citations, fitted inputs renamed as predictions, or ansatz smuggling appear; the framework extends independently to linear models and RKHS regression. The chain is verifiable against external benchmarks and contains independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the sequence model being a valid abstraction for the learned-kernel setting and on the ESD being finite and well-defined for arbitrary kernels without additional decay assumptions.

axioms (1)
  • domain assumption Sequence model setting suffices to capture learned-kernel spectral algorithms
    Abstract states results for sequence models whose ESD is at most K
invented entities (1)
  • Effective span dimension (ESD) no independent evidence
    purpose: Alignment-sensitive complexity measure depending on signal, spectrum, and noise
    New quantity introduced to replace traditional effective dimension

pith-pipeline@v0.9.0 · 5682 in / 1255 out tokens · 27174 ms · 2026-05-18T13:51:17.152549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

  1. [1]

    Agarwal, D

    A. Agarwal, D. Shah, D. Shen, and D. Song. On robustness of principal component regression.Advances in Neural Information Processing Systems, 32, 2019

  2. [2]

    A Convergence Theory for Deep Learning via Over-Parameterization

    Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over- parameterization, June 2019. URLhttp://arxiv.org/abs/1811.03962

  3. [3]

    Arora, S

    S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019. 25

  4. [4]

    J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, and G. Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation, May 2022

  5. [5]

    P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, Dec. 2020. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1907378117

  6. [6]

    P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

  7. [7]

    Barzilai and O

    D. Barzilai and O. Shamir. Generalization in kernel regression under realistic assumptions. arXiv preprint arXiv:2312.15995, 2023

  8. [8]

    Bauer, S

    F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007. doi: 10.1016/j.jco.2006.07.001

  9. [9]

    X. Bing, F. Bunea, S. Strimas-Mackey, and M. Wegkamp. Prediction under latent factor regression: Adaptive pcr, interpolating predictors and beyond.Journal of Machine Learning Research, 22(177):1–50, 2021

  10. [10]

    Bordelon, A

    B. Bordelon, A. Atanasov, and C. Pehlevan. How feature learning can improve neural scaling laws. InThe Thirteenth International Conference on Learning Representations,

  11. [11]

    URLhttps://openreview.net/forum?id=dEypApI1MZ

  12. [12]

    L. D. Brown, T. T. Cai, M. G. Low, and C.-H. Zhang. Asymptotic equivalence theory for nonparametric regression with random design.The Annals of Statistics, 30(3):688–707,

  13. [13]

    doi: 10.1214/aos/1028674838

    ISSN 0090-5364. doi: 10.1214/aos/1028674838

  14. [14]

    Caponnetto and E

    A. Caponnetto and E. D. Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007

  15. [15]

    Cortes, M

    C. Cortes, M. Mohri, and A. Rostamizadeh. Algorithms for learning kernels based on centered alignment.The Journal of Machine Learning Research, 13(1):795–828, 2012

  16. [16]

    Cristianini, J

    N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. Advances in neural information processing systems, 14, 2001

  17. [17]

    H. W. Engl, M. Hanke, and A. Neubauer.Regularization of Inverse Problems, volume

  18. [18]

    Springer Science & Business Media, 1996

  19. [19]

    Gatmiry, S

    K. Gatmiry, S. Jegelka, and J. Kelner. Optimization and Adaptive Generalization of Three layer Neural Networks. InInternational Conference on Learning Representations, Oct. 2021. URLhttps://openreview.net/forum?id=dPyRNUlttBv

  20. [20]

    Gedon, A

    D. Gedon, A. H. Ribeiro, and T. B. Schön. No double descent in principal component regression: A high-dimensional analysis. InForty-first International Conference on Machine Learning, 2024

  21. [21]

    L. L. Gerfo, L. Rosasco, F. Odone, E. D. Vito, and A. Verri. Spectral algorithms for supervised learning.Neural Computation, 20(7):1873–1897, 2008. 26

  22. [22]

    Ghorbani, S

    B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33: 14820–14830, 2020

  23. [23]

    Green and E

    A. Green and E. Romanov. The high-dimensional asymptotics of principal component regression.arXiv preprint arXiv:2405.11676, 2024

  24. [24]

    N. T. Huang, D. W. Hogg, and S. Villar. Dimensionality reduction, regularization, and generalization in overparameterized regressions.SIAM Journal on Mathematics of Data Science, 4(1):126–152, 2022

  25. [25]

    Hucker and M

    L. Hucker and M. Wahl. A note on the prediction error of principal component regression in high dimensions.Theory of Probability and Mathematical Statistics, 109:37–53, 2023

  26. [26]

    Jacot, F

    A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and gen- eralization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URLhttps://proceedings.neurips. cc/paper/2018/file/5a4be...

  27. [27]

    I. M. Johnstone. Gaussian estimation: Sequence and wavelet models. 2017

  28. [28]

    S. Karp, E. Winston, Y. Li, and A. Singh. Local signal adaptivity: Provable feature learning in neural networks beyond kernels.Advances in Neural Information Processing Systems, 34:24883–24897, 2021

  29. [29]

    Kornblith, M

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representa- tions revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019

  30. [30]

    Kunin, A

    D. Kunin, A. Raventós, C. Dominé, F. Chen, D. Klindt, A. Saxe, and S. Ganguli. Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning.Advances in Neural Information Processing Systems, 37:81157–81203, 2024

  31. [31]

    Li and Q

    Y. Li and Q. Lin. Improving adaptivity via over-parameterization in sequence models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=UfLH4T676K

  32. [32]

    Li and Q

    Y. Li and Q. Lin. Diagonal over-parameterization in reproducing kernel hilbert spaces as an adaptive feature model: Generalization and adaptivity.arXiv preprint arXiv:2501.08679, 2025

  33. [33]

    Y. Li, W. Gan, Z. Shi, and Q. Lin. Generalization error curves for analytic spectral algorithms under power-law decay.arXiv preprint arXiv:2401.01599, 2024

  34. [34]

    Y. H. Liu, A. Baratin, J. Cornford, S. Mihalas, E. T. SheaBrown, and G. Lajoie. How connectivity structure shapes rich and lazy learning in neural circuits. InThe Twelfth International Conference on Learning Representations, 2024

  35. [35]

    Mathé and S

    P. Mathé and S. V. Pereverzev. Geometry of linear ill-posed problems in variable hilbert scales.Inverse problems, 19(3):789, 2003. 27

  36. [36]

    Radhakrishnan, D

    A. Radhakrishnan, D. Beaglehole, P. Pandit, and M. Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383(6690):1461–1467, 2024

  37. [37]

    Rosasco, E

    L. Rosasco, E. De Vito, and A. Verri. Spectral methods for regularization in learning theory.DISI, Universita degli Studi di Genova, Italy, Technical Report DISI-TR-05-18, 2005

  38. [38]

    Schölkopf and A

    B. Schölkopf and A. J. Smola.Learning with Kernels: Support Vector Machines, Regular- ization, Optimization, and Beyond. MIT Press, 2002

  39. [39]

    Seleznova and G

    M. Seleznova and G. Kutyniok. Analyzing finite neural networks: Can we trust neural tangent kernel theory? InMathematical and Scientific Machine Learning, pages 868–895. PMLR, 2022. URLhttps://proceedings.mlr.press/v145/seleznova22a.html

  40. [40]

    Z. Shi, J. Wei, and Y. Liang. Provable guarantees for neural networks via gradient feature learning.Advances in Neural Information Processing Systems, 36:55848–55918, 2023

  41. [41]

    Steinwart and A

    I. Steinwart and A. Christmann.Support vector machines. Springer Science & Business Media, 2008

  42. [42]

    Steinwart and C

    I. Steinwart and C. Scovel. Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs. 2012. doi: 10.1007/S00365-012-9153-3

  43. [43]

    Tsigler and P

    A. Tsigler and P. L. Bartlett. Benign overfitting in ridge regression.Journal of Machine Learning Research, 24(123):1–76, 2023

  44. [44]

    Wahba.Spline Models for Observational Data, volume 59 ofCBMS-NSF Regional Conference Series in Applied Mathematics

    G. Wahba.Spline Models for Observational Data, volume 59 ofCBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, 1990

  45. [45]

    C. Wang, X. He, Y. Wang, and J. Wang. On the target-kernel alignment: a unified analysis with kernel complexity.Advances in Neural Information Processing Systems, 37: 40434–40485, 2024

  46. [46]

    Wenger, F

    J. Wenger, F. Dangel, and A. Kristiadi. On the disconnect between theory and practice of overparametrized neural networks, Sept. 2023. URLhttp://arxiv.org/abs/2310.00137

  47. [47]

    Woodworth, S

    B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. InProceedings of Thirty Third Conference on Learning Theory, pages 3635–3673. PMLR, July 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html

  48. [48]

    Wu and J

    D. Wu and J. Xu. On the optimal weightedℓ2 regularization in overparameterized linear regression.Advances in Neural Information Processing Systems, 33:10112–10123, 2020

  49. [49]

    Xu and D

    J. Xu and D. J. Hsu. On the number of variables to use in principal component regression. Advances in neural information processing systems, 32, 2019

  50. [50]

    Xu and L

    Y. Xu and L. Ziyin. Three mechanisms of feature learning in a linear network. InThe Thirteenth International Conference on Learning Representations, 2025. 28

  51. [51]

    Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26:289–315, Aug. 2007. doi: 10.1007/s00365-006-0663-2

  52. [52]

    B. Yu. Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics, pages 423–435. Springer, 1997

  53. [53]

    Zhang, Y

    H. Zhang, Y. Li, and Q. Lin. On the optimality of misspecified spectral algorithms, Mar. 2023

  54. [54]

    Zhang, J

    H. Zhang, J. Lai, Y. Li, Q. Lin, and J. S. Liu. Towards a statistical understanding of neural networks: Beyond the neural tangent kernel theories.arXiv preprint arXiv:2412.18756, 2024

  55. [55]

    T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005. doi: 10.1162/0899766054323008. 29 A Proof A.1 Proofs of Results on ESD of Sequence Models Proof of Theorem 3.2.For anyν >0, define kΛ(ν) = #{j:λ j ≥ν}, which counts how many eigenvalues exceed the thresholdν. The KPCR estimator s...

  56. [56]

    Let Bn = {π1, . . . , πKn}. We define the collection of hypercubes verticesV = {−1, 1}K. For every vertexv∈ V , define a parameter vector θ(v) = (θ(v) j )d j=1 as in Equation (32). There are2K such vectors {θ(v)}. For eachv∈ V , let Pv be the sampling distribution of the sequence model in Equation (4) withθ∗ = θ(v), σ2 = σ2 0/n, and {ξj}j∈[d] being i.i.d....

  57. [57]

    We can then apply the argument in Theorem 8.3 withσ2 replaced by¯σ2/n

    Therefore, σ2 ≤¯σ2/n. We can then apply the argument in Theorem 8.3 withσ2 replaced by¯σ2/n. Lower bound: We establish the lower bound using Assouad’s method. Let m = ⌊c1K⌋. Consider the firstm eigenfunctions {ψπj }j≤m corresponding to the largest eigenvalues {λπj }j≤m. Define the collection of hypercubes verticesV = {−1, 1}m. For every vertexv∈ V, define...

  58. [58]

    It then follows that∥f(v)∥2 ∞ ≤σ 2 0C2 0

    , σ2 0C2 0 C1κ2 . It then follows that∥f(v)∥2 ∞ ≤σ 2 0C2 0. For eachv∈ V , letPv be the sampling distribution of{zi = (xi, yi)}i≤n from the regression model Equation (20) withf ∗ = f (v). Let ρ be the Hamming distance onV. If v and w∈ V differ in exactly one coordinate (i.e.,ρ(v, w) = 1), then •∥f (v) −f (w)∥2 L2(µ) ≥(2γ) 2, and 35 • the Kullback-Leibler ...

  59. [59]

    We choose γ2 =n −1 min σ2 0, ¯σ2 4(1 +C 2

    Furthermore, ifmγ2 ≤ (2n)−1σ2 0Kn, we can the same argument in Theorem A.2 (in particular, using Theorem 4.1 to dervie Equation (33)) to show thatf (v) ∈ FK,k. We choose γ2 =n −1 min σ2 0, ¯σ2 4(1 +C 2

  60. [60]

    , σ2 0C2 0 C1κ2 , which impliesf (v) ∈ FK,k. 36 We then follows the same argument in the proof of lower bound in Theorem 8.5 to obtain sup v∈V Ev bθ−θ (v) 2 ≥m (2γ)2 4 =c σ2 0K n , wherecis a constant that depends onC 0,κ,c 1,C 1. C Proofs for Result on overparameterized gradient flow In this section, we prove Theorem 5.2. The high-level idea is as follow...

  61. [61]

    We have y≥θ(t)≥0∀t≥0

  62. [62]

    4.|θ ∗ −θ(t)|is decreasing provided that|θ ∗ −θ(t)|> κ

    Sincey=θ ∗ +ξand|ξ| ≤κ, we have |θ∗ −θ(t)| ≤ |θ ∗|+κ,∀t≥0. 4.|θ ∗ −θ(t)|is decreasing provided that|θ ∗ −θ(t)|> κ

  63. [63]

    Ify <0, Items 3, 4, and 5 still hold, while Items 1 and 2 can be modified by symmetry

    If|θ ∗ −θ(t 1)| ≤κfor somet 1, we have |θ∗ −θ(t)| ≤κfor allt≥t 1. Ify <0, Items 3, 4, and 5 still hold, while Items 1 and 2 can be modified by symmetry. Proof.Items 1 and 2 are directly implied from Equation (38). Item 3 is implied by Item 2. To prove Item 4, consider Equation (52), from which we have ˙θ=θ 2 a−2 +D 2b−2 +β −2 (y−θ), which implies ˙θ≥0. Si...

  64. [64]

    Furthermore, Equation (65) implies that T1 ≥inf t≥0 : √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)ds≥r ′

    Consequently, we haveb2(t)≤(1 + 1 D)b2 0, and thus |β(t)| ≤ √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)dt,(65) which implies Equation (64) by using the fact that|θ| = |abDβ|. Furthermore, Equation (65) implies that T1 ≥inf t≥0 : √ 2ea0bD 0 Z t 0 (|θ∗|+|ξ|)ds≥r ′ . Then when t > T 1, in the following, supposea0 ≤b 0/D. We haver′ = a0 and R′ = b0/D. Consider t∈ (T1, T2). W...

  65. [65]

    By Grönwall inequality, we have |β(t)| ≤a 0 exp √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds , t∈(T 1, T2)

    Consequently, Equation (38) implies that |β(t)| ≤a 0 + √ 2ebD 0 Z t T1 |β(s)|(|θ ∗|+|ξ|)ds,(66) for anyt∈(T 1, T2). By Grönwall inequality, we have |β(t)| ≤a 0 exp √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds , t∈(T 1, T2). 48 By definition ofT2, we have T2 ≥inf t≥T 1 : √ 2ebD 0 Z t T1 (|θ∗|+|ξ|)ds= ln b0/D a0 .(67) The bound forθ(t)now follows from using the boundsa(t) ...

  66. [66]

    If ˙x(t)≥k x(t), x(0) =x 0, then for allt≥0, it holds that x(t)≥x 0 ekt, and for everyM≥x 0, we have inf{t≥0 :x(t)≥M} ≤ 1 k log M x0

  67. [67]

    double-descent

    If ˙x(t)≤ −k x(t), x(0) =x 0, then for allt≥0, it holds that x(t)≤x 0 e−kt, and for every0< M≤x 0, we have inf{t≥0 :x(t)≤M} ≤ 1 k log x0 M . D Related Work on Principal Component Regression As discussed in Section 3.1, the PC estimator serves as a motivating example for the concepts of ESD and span profile due to its clear illustration of bias-variance tr...

  68. [68]

    misalignment

    extend the analysis to a general covariance matrixΣx and an anisotropic prior satisfying 51 Eβ∗β⊤ ∗ =Σ β. They also derive an exact risk expression and demonstrate how “misalignment” betweenΣ x andΣ β affects risk; here “alignment” refers to concordance between the orderings of their eigenvalues. Both studies assume knowledge of the eigenvectors of the po...

  69. [69]

    Hucker and Wahl [22] derive non-asymptotic error bounds for PCR in kernel regression

    Huang et al.[21] derive non-asymptotic risk bounds for PCR in more general settings by analyzing the alignment between population and empirical principal components. Hucker and Wahl [22] derive non-asymptotic error bounds for PCR in kernel regression. Large Language Models Statement Large language models were used to improve the clarity of the manuscript ...