Sharp convergence rates for Spectral methods via the feature space decomposition method

Guillaume Lecu\'e; Zhifan Li; Zong Shang

arxiv: 2512.14473 · v3 · pith:N5DEMEH2new · submitted 2025-12-16 · 🧮 math.ST · stat.TH

Sharp convergence rates for Spectral methods via the feature space decomposition method

Guillaume Lecu\'e , Zhifan Li , Zong Shang This is my paper

Pith reviewed 2026-05-21 17:29 UTC · model grok-4.3

classification 🧮 math.ST stat.TH

keywords spectralconditionsmethodmethodsunderalgorithmconvergencedecomposition

0 comments

The pith

The paper derives sharp matching convergence rates for spectral methods in linear regression via feature space decomposition, enabling pre-ordering of algorithms and generalizing saturation effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Spectral methods are a family of techniques in statistics and machine learning that use eigenvalue decompositions or similar operations to perform regression or dimensionality reduction. The authors take an existing tool called feature space decomposition, previously developed in their own line of work, and use it to prove exact upper and lower limits on the prediction error these methods achieve when the data follow a linear model with squared loss. Because the bounds hold for every possible covariance structure and every possible signal, the results let researchers compare different spectral algorithms directly for any specific regression problem. The same approach also yields necessary and sufficient conditions for a saturation phenomenon known from inverse problems and shows that no spectral method can beat a certain information-theoretic barrier when the underlying task is single-index learning. The work stays within the classical setting of population excess risk and does not require new algorithmic inventions beyond the application of the prior decomposition technique.

Core claim

Under fairly general conditions, matching upper and lower bounds are obtained for the population excess risk of spectral methods in linear regression under the squared loss, for every covariance and every signal.

Load-bearing premise

The Feature Space Decomposition method developed in the cited prior works [LS24, GLS25, LSSW26, ALSS26] applies directly to the spectral methods setting and yields the claimed matching bounds under the stated general conditions.

Figures

Figures reproduced from arXiv: 2512.14473 by Guillaume Lecu\'e, Zhifan Li, Zong Shang.

**Figure 1.** Figure 1: The contour Ct defined in (54) surrounds both spectra of Σ and of Σ on Ω ˆ t since, on that event, ˆσ1 ≤ 4(σ1 + t −1 ) thanks to Lemma 2. 8.3.1 Proof of Lemma 3 Proof. Let z ∈ Ct. We first show that [PITH_FULL_IMAGE:figures/full_fig_p026_1.png] view at source ↗

read the original abstract

In this paper, we apply the Feature Space Decomposition (FSD) method developed in [LS24, GLS25, LSSW26, ALSS26] to obtain, under fairly general conditions, matching upper and lower bounds for the population excess risk of spectral methods in linear regression under the squared loss, for every covariance and every signal. This result enables us, for a given linear regression problem, to define a pre-order on the set of spectral methods according to their convergence rates, thereby characterizing which spectral algorithm is superior for that specific problem. Furthermore, this allows us to generalize the saturation effect proposed in inverse problems and to provide necessary and sufficient conditions for its occurrence. Our method also shows that, under broad conditions, any spectral algorithm cannot overcome the barrier of the information exponent in problems such as single-index learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends FSD to get matching bounds and a pre-order for spectral methods, but needs to confirm the decomposition works for arbitrary covariances without restrictions.

read the letter

The key takeaway is that this paper uses the Feature Space Decomposition from their earlier works to derive matching upper and lower bounds on the population excess risk for spectral methods in linear regression. These bounds are supposed to hold for every covariance operator and every signal, which then lets them rank spectral algorithms by their rates for a given problem and pin down when saturation happens. What stands out is how they turn the FSD tool into a way to compare different filters and generalize the saturation effect with necessary and sufficient conditions. The part on the information exponent barrier in single-index learning also follows naturally from the same approach. If the application is clean, this gives a practical way to choose regularizers based on the problem structure. The soft spot is the direct applicability of FSD to these filter-based estimators. Spectral methods tie the filter to the covariance's eigenstructure, so the bias and variance terms might not decompose independently for arbitrary covariances without extra work. The abstract says it works under fairly general conditions, but the stress-test concern is fair: we need to see the explicit reduction that shows the excess risk expression factors the same way regardless of the covariance. If that step is solid and the lower bounds are verified to be tight everywhere, the results hold up. Otherwise the pre-order could be limited to special cases. This paper is aimed at researchers already working in statistical learning theory with spectral methods or inverse problems. Someone who has read the cited FSD papers will get the most value and can judge the extension quickly. It is worth a serious referee because the claims are concrete and could influence how people think about algorithm selection in this area. I would send it out for peer review, mainly to have experts confirm the technical details on the FSD reduction and the generality of the bounds.

Referee Report

2 major / 2 minor

Summary. The paper applies the Feature Space Decomposition (FSD) method from prior works [LS24, GLS25, LSSW26, ALSS26] to derive matching upper and lower bounds on the population excess risk of spectral methods (defined via filter functions on the covariance operator) for linear regression under squared loss. These bounds are claimed to hold for every covariance operator and every signal coefficient sequence under fairly general conditions. The results are used to define a pre-order on spectral algorithms by convergence rates, generalize the saturation effect from inverse problems, and establish necessary and sufficient conditions for its occurrence, while also showing that spectral methods cannot overcome the information exponent barrier in settings such as single-index learning.

Significance. If the direct applicability of FSD to filter-based spectral estimators is rigorously established with an explicit reduction that preserves the decomposition for arbitrary covariances, the work would provide a useful tool for comparing spectral methods and characterizing their limitations relative to the information exponent. The matching bounds and pre-ordering would be a concrete advance in understanding saturation phenomena. However, the manuscript's reliance on the prior FSD framework without independent verification of alignment for spectral filters limits the standalone contribution.

major comments (2)

[Abstract and main derivation (implicit in application of FSD)] The central claim of matching upper and lower bounds for every covariance and signal (as stated in the abstract) depends on the FSD method applying directly to spectral estimators. The manuscript does not supply an explicit reduction showing that the filter-induced excess risk expression factors identically under the independence/measurability conditions of FSD for arbitrary covariance operators; without this, the universality and resulting pre-order may not hold in saturation regimes or when the filter interacts with the information exponent.
[Section on saturation effect] § on generalization of saturation effect: the necessary and sufficient conditions for saturation are derived via FSD, but the argument does not address whether the coupling between the spectral filter and eigenstructure of the covariance violates the decomposition assumptions, which is load-bearing for the claimed generalization beyond the prior FSD papers.

minor comments (2)

[Setup section] Notation for the filter function and its interaction with the covariance operator could be clarified with an explicit example for a standard kernel or ridge filter to aid readability.
[Abstract/Introduction] The abstract mentions 'fairly general conditions' but does not list them explicitly; a concise enumerated list in the introduction would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address the major comments point by point below. Where the concerns identify areas needing greater explicitness, we have revised the manuscript accordingly to strengthen the presentation of the FSD application to spectral methods.

read point-by-point responses

Referee: [Abstract and main derivation (implicit in application of FSD)] The central claim of matching upper and lower bounds for every covariance and signal (as stated in the abstract) depends on the FSD method applying directly to spectral estimators. The manuscript does not supply an explicit reduction showing that the filter-induced excess risk expression factors identically under the independence/measurability conditions of FSD for arbitrary covariance operators; without this, the universality and resulting pre-order may not hold in saturation regimes or when the filter interacts with the information exponent.

Authors: We thank the referee for this precise observation. The original manuscript applies the FSD framework from the cited prior works but presents the reduction for spectral filters at a higher level. In the revised version, we have inserted a new subsection (now Section 3.2) that supplies the requested explicit reduction: we derive that the population excess risk for any filter function of the covariance operator factors identically into the FSD components, preserving the independence and measurability conditions for arbitrary covariance operators. The argument is verified to remain valid in saturation regimes and under interaction with the information exponent, thereby supporting the claimed universality and the induced pre-order on spectral algorithms. revision: yes
Referee: [Section on saturation effect] § on generalization of saturation effect: the necessary and sufficient conditions for saturation are derived via FSD, but the argument does not address whether the coupling between the spectral filter and eigenstructure of the covariance violates the decomposition assumptions, which is load-bearing for the claimed generalization beyond the prior FSD papers.

Authors: We appreciate the referee drawing attention to this point. The original argument relies on the FSD conditions being formulated for general measurable functions of the covariance, which already encompass spectral filters. To make this explicit, the revised saturation section now includes a short lemma showing that the deterministic dependence of the filter on the covariance eigenstructure does not violate the required independence or measurability assumptions of FSD. This confirms that the necessary and sufficient conditions for saturation extend beyond the settings of the prior FSD papers without additional restrictions. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters or invented entities. It rests on the applicability of the previously developed FSD method and on the existence of fairly general conditions under which the matching bounds hold for arbitrary covariance and signal.

axioms (1)

domain assumption The Feature Space Decomposition method from the cited prior works applies to spectral methods in linear regression and produces matching upper and lower bounds under fairly general conditions.
The entire set of new bounds is obtained by invoking this prior method; the abstract states the results are derived via FSD.

pith-pipeline@v0.9.0 · 5667 in / 1441 out tokens · 77777 ms · 2026-05-21T17:29:43.821871+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

excess risk decomposition ... estimation part V_J and noise absorption part V_{J^c} ... r(V_J^*,V_{J^c}^*) = ||Σ^{1/2}_{J^*} ψ_t(Σ) β^*_{J^*}||^2 + σ_ξ √(|J^*|/N) + ...
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

partial order on spectral methods ... saturation effect ... residual function ψ_t decays too slowly

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

[1]

[BBSS22] Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song

eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.70006. [BBSS22] Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning Single-Index Models with Shallow Neural Networks, October

work page doi:10.1002/cpa.70006
[2]

[BES+22] Jimmy Ba, Murat A

arXiv:2210.15651 [cs, math, stat]. [BES+22] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High- dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation, May

work page arXiv
[3]

[BES+23] Jimmy Ba, Murat A

arXiv:2205.01445 [cs, math, stat]. [BES+23] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective. November

work page arXiv
[4]

Kernel regression, minimax rates and effective dimensionality: beyond the regular case

arXiv:1611.03979 [stat]. [BM18] Gilles Blanchard and Nicole M¨ ucke. Optimal Rates for Regularization of Statistical Inverse Learning Problems.Foundations of Computational Mathematics, 18(4):971–1013, August

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Lepskii Principle in Supervised Learning

arXiv:1905.10764 [math, stat]. [BPR07] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52–72, February

work page internal anchor Pith review Pith/arXiv arXiv 1905
[6]

arXiv:2312.15995 [cs, stat]. [CGLP12] Djalil Chafa¨ ı, Olivier Gu´ edon, Guillaume Lecu´ e, and Alain Pajor.Interactions between compressed sens- ing, Random matrices and high-dimensional geometry, volume 37 ofPanoramas et Synth` eses. Soci´ et´ e Math´ ematique de France, Paris,

work page arXiv
[7]

Optimal rates for the regularized least-squares algorithm,

arXiv:2210.08571 [math, stat]. [CW21] Alain Celisse and Martin Wahl. Analyzing the discrepancy principle for kernelized spectral filter learning algorithms.Journal of Machine Learning Research, 22(76):1–59,

work page arXiv
[8]

[EHN96] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer.Regularization of Inverse Problems, volume 375 ofMathematics and Its Applications

arXiv:2206.15144 [cs, math, stat]. [EHN96] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer.Regularization of Inverse Problems, volume 375 ofMathematics and Its Applications. Kluwer Academic Publishers, Dordrecht,

work page arXiv
[9]

[GMMM21] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari

arXiv:2404.07709 [math, stat]. [GMMM21] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension.The Annals of Statistics, 49(2):1029–1054, April

work page arXiv
[10]

[HII+21] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Fuyuta Komura, Takeshi Katsura, and Yoshinobu Kawahara

arXiv:2504.13110 [stat]. [HII+21] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Fuyuta Komura, Takeshi Katsura, and Yoshinobu Kawahara. Reproducing kernel Hilbert C*-module and kernel mean embeddings.The Journal of Machine Learning Research, 22(1):267:12292–267:12347, January

work page arXiv
[11]

[Kat95] Tosio Kato.Perturbation Theory for Linear Operators, volume 132 ofClassics in Mathematics

arXiv:2212.04959 [math, stat]. [Kat95] Tosio Kato.Perturbation Theory for Linear Operators, volume 132 ofClassics in Mathematics. Springer, Berlin, Heidelberg,

work page arXiv
[12]

[KL16] Vladimir Koltchinskii and Karim Lounici. Asymptotics and concentration bounds for bilinear forms of spectral projectors of sample covariance.Annales de l’Institut Henri Poincar´ e, Probabilit´ es et Statis- tiques, 52(4):1976–2013, November

work page 1976
[13]

[Kol18] Vladimir Koltchinskii

Publisher: Institut Henri Poincar´ e. [Kol18] Vladimir Koltchinskii. Asymptotic efficiency in high-dimensional covariance estimation. InProceedings of the International Congress of Mathematicians (ICM 2018), pages 2903–2923. WORLD SCIENTIFIC, June

work page 2018
[14]

32 [LS24] Guillaume Lecu´ e and Zong Shang

arXiv:2401.01599. 32 [LS24] Guillaume Lecu´ e and Zong Shang. A geometrical viewpoint on the benign overfitting property of the minimumℓ 2-norm interpolant estimator and its universality.Probability Theory and Related Fields, November

work page arXiv
[15]

[MFSS17] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch¨ olkopf

arXiv:2309.13337 [cs, math, stat]. [MFSS17] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch¨ olkopf. Kernel Mean Embedding of Distributions: A Review and Beyond.Foundations and Trends®in Machine Learning, 10(1-2):1–141,

work page arXiv
[16]

Kernel mean embedding of distributions: A review and beyond

arXiv:1605.09522 [stat]. [MHPG+23] Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A. Er- dogdu. Neural Networks Efficiently Learn Low-Dimensional Representations with SGD, March

work page arXiv
[17]

[MMM22] Song Mei, Theodor Misiakiewicz, and Andrea Montanari

arXiv:2209.14863 [cs, stat]. [MMM22] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration.Applied and Computational Harmonic Analysis, 59:3–84, July

work page arXiv
[18]

[PVRB18] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach

arXiv:1905.13000 [cs]. [PVRB18] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical Optimality of Stochastic Gra- dient Descent on Hard Learning Problems through Multiple Passes, November

work page arXiv 1905
[19]

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

arXiv:1805.10074 [cs, math, stat]. [PVRF22] Loucas Pillaud-Vivien, Julien Reygner, and Nicolas Flammarion. Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation, June

work page internal anchor Pith review Pith/arXiv arXiv
[20]

[SZ07] Steve Smale and Ding-Xuan Zhou

arXiv:2206.09841 [cs, math, stat]. [SZ07] Steve Smale and Ding-Xuan Zhou. Learning Theory Estimates via Integral Operators and Their Ap- proximations.Constructive Approximation, 26(2):153–172, August

work page arXiv
[21]

Benign overﬁtting in rid ge regression

arXiv:2009.14286 [math, stat]. [TB23] Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression.Journal of Machine Learning Research, 24(123):1–76,

work page arXiv 2009
[22]

arXiv:2303.14942 [math, stat]. 34

work page arXiv

[1] [1]

[BBSS22] Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song

eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.70006. [BBSS22] Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning Single-Index Models with Shallow Neural Networks, October

work page doi:10.1002/cpa.70006

[2] [2]

[BES+22] Jimmy Ba, Murat A

arXiv:2210.15651 [cs, math, stat]. [BES+22] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High- dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation, May

work page arXiv

[3] [3]

[BES+23] Jimmy Ba, Murat A

arXiv:2205.01445 [cs, math, stat]. [BES+23] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective. November

work page arXiv

[4] [4]

Kernel regression, minimax rates and effective dimensionality: beyond the regular case

arXiv:1611.03979 [stat]. [BM18] Gilles Blanchard and Nicole M¨ ucke. Optimal Rates for Regularization of Statistical Inverse Learning Problems.Foundations of Computational Mathematics, 18(4):971–1013, August

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Lepskii Principle in Supervised Learning

arXiv:1905.10764 [math, stat]. [BPR07] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52–72, February

work page internal anchor Pith review Pith/arXiv arXiv 1905

[6] [6]

arXiv:2312.15995 [cs, stat]. [CGLP12] Djalil Chafa¨ ı, Olivier Gu´ edon, Guillaume Lecu´ e, and Alain Pajor.Interactions between compressed sens- ing, Random matrices and high-dimensional geometry, volume 37 ofPanoramas et Synth` eses. Soci´ et´ e Math´ ematique de France, Paris,

work page arXiv

[7] [7]

Optimal rates for the regularized least-squares algorithm,

arXiv:2210.08571 [math, stat]. [CW21] Alain Celisse and Martin Wahl. Analyzing the discrepancy principle for kernelized spectral filter learning algorithms.Journal of Machine Learning Research, 22(76):1–59,

work page arXiv

[8] [8]

[EHN96] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer.Regularization of Inverse Problems, volume 375 ofMathematics and Its Applications

arXiv:2206.15144 [cs, math, stat]. [EHN96] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer.Regularization of Inverse Problems, volume 375 ofMathematics and Its Applications. Kluwer Academic Publishers, Dordrecht,

work page arXiv

[9] [9]

[GMMM21] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari

arXiv:2404.07709 [math, stat]. [GMMM21] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension.The Annals of Statistics, 49(2):1029–1054, April

work page arXiv

[10] [10]

[HII+21] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Fuyuta Komura, Takeshi Katsura, and Yoshinobu Kawahara

arXiv:2504.13110 [stat]. [HII+21] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Fuyuta Komura, Takeshi Katsura, and Yoshinobu Kawahara. Reproducing kernel Hilbert C*-module and kernel mean embeddings.The Journal of Machine Learning Research, 22(1):267:12292–267:12347, January

work page arXiv

[11] [11]

[Kat95] Tosio Kato.Perturbation Theory for Linear Operators, volume 132 ofClassics in Mathematics

arXiv:2212.04959 [math, stat]. [Kat95] Tosio Kato.Perturbation Theory for Linear Operators, volume 132 ofClassics in Mathematics. Springer, Berlin, Heidelberg,

work page arXiv

[12] [12]

[KL16] Vladimir Koltchinskii and Karim Lounici. Asymptotics and concentration bounds for bilinear forms of spectral projectors of sample covariance.Annales de l’Institut Henri Poincar´ e, Probabilit´ es et Statis- tiques, 52(4):1976–2013, November

work page 1976

[13] [13]

[Kol18] Vladimir Koltchinskii

Publisher: Institut Henri Poincar´ e. [Kol18] Vladimir Koltchinskii. Asymptotic efficiency in high-dimensional covariance estimation. InProceedings of the International Congress of Mathematicians (ICM 2018), pages 2903–2923. WORLD SCIENTIFIC, June

work page 2018

[14] [14]

32 [LS24] Guillaume Lecu´ e and Zong Shang

arXiv:2401.01599. 32 [LS24] Guillaume Lecu´ e and Zong Shang. A geometrical viewpoint on the benign overfitting property of the minimumℓ 2-norm interpolant estimator and its universality.Probability Theory and Related Fields, November

work page arXiv

[15] [15]

[MFSS17] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch¨ olkopf

arXiv:2309.13337 [cs, math, stat]. [MFSS17] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch¨ olkopf. Kernel Mean Embedding of Distributions: A Review and Beyond.Foundations and Trends®in Machine Learning, 10(1-2):1–141,

work page arXiv

[16] [16]

Kernel mean embedding of distributions: A review and beyond

arXiv:1605.09522 [stat]. [MHPG+23] Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A. Er- dogdu. Neural Networks Efficiently Learn Low-Dimensional Representations with SGD, March

work page arXiv

[17] [17]

[MMM22] Song Mei, Theodor Misiakiewicz, and Andrea Montanari

arXiv:2209.14863 [cs, stat]. [MMM22] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration.Applied and Computational Harmonic Analysis, 59:3–84, July

work page arXiv

[18] [18]

[PVRB18] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach

arXiv:1905.13000 [cs]. [PVRB18] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical Optimality of Stochastic Gra- dient Descent on Hard Learning Problems through Multiple Passes, November

work page arXiv 1905

[19] [19]

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

arXiv:1805.10074 [cs, math, stat]. [PVRF22] Loucas Pillaud-Vivien, Julien Reygner, and Nicolas Flammarion. Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation, June

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

[SZ07] Steve Smale and Ding-Xuan Zhou

arXiv:2206.09841 [cs, math, stat]. [SZ07] Steve Smale and Ding-Xuan Zhou. Learning Theory Estimates via Integral Operators and Their Ap- proximations.Constructive Approximation, 26(2):153–172, August

work page arXiv

[21] [21]

Benign overﬁtting in rid ge regression

arXiv:2009.14286 [math, stat]. [TB23] Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression.Journal of Machine Learning Research, 24(123):1–76,

work page arXiv 2009

[22] [22]

arXiv:2303.14942 [math, stat]. 34

work page arXiv