pith. sign in

arxiv: 2510.04237 · v6 · submitted 2025-10-05 · 💻 cs.LG

Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions

Pith reviewed 2026-05-18 09:50 UTC · model grok-4.3

classification 💻 cs.LG
keywords kernel SGDstochastic gradient descentspherical radial basis functionsgeneralization boundsreproducing kernel Hilbert spaceconvergence rateslarge-scale learningstreaming data
0
0 comments X

The pith

Truncated kernel SGD projects stochastic gradients onto adaptively scaled finite-dimensional spaces using spherical radial basis functions and achieves minimax-optimal convergence rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a kernel stochastic gradient descent method that uses an infinite series expansion of spherical radial basis functions to project gradients into a finite hypothesis space. This projection is scaled adaptively based on the bias-variance trade-off to improve generalization. A new analysis of the spectral properties of the kernel covariance operator unifies the optimization and statistical generalization bounds. The authors prove that the last iterate and the suffix average of the iterates converge at the best possible rates, along with strong convergence in the reproducing kernel Hilbert space. The approach handles common loss functions and offers better efficiency for large datasets and streaming data by using coordinate-wise updates instead of full kernel matrix operations.

Core claim

By projecting the stochastic gradient onto a finite-dimensional space via the series expansion of spherical radial basis functions and using a refined estimate of the eigenvalue decay of the covariance operator, the algorithm attains minimax optimal rates for both the last iterate and the averaged iterates, as well as optimal rates in the RKHS norm, for a wide range of convex loss functions.

What carries the argument

Adaptive truncation of the kernel gradient using the infinite series of spherical radial basis functions, driven by a new spectral estimate of the covariance operator to balance bias and variance.

If this is right

  • The last iterate of the algorithm converges at the minimax-optimal rate.
  • The suffix average also achieves the optimal rate.
  • Optimal strong convergence holds in the reproducing kernel Hilbert space.
  • The method works for least-squares, Huber, and logistic losses.
  • Computational and storage complexity are reduced to linear in the sample size, supporting streaming data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the spectral estimation holds for other kernel families, similar truncation strategies could extend to non-radial kernels.
  • The coordinate-wise update structure suggests natural parallels to coordinate descent methods in high-dimensional statistics.
  • Testing the adaptive scaling rule on datasets with varying noise levels could confirm the bias-variance justification in practice.
  • Strong RKHS convergence may imply faster rates for downstream tasks like model selection or uncertainty quantification.

Load-bearing premise

The estimate of the spectral decay of the kernel-induced covariance operator must be sufficiently precise to correctly guide the adaptive choice of projection dimension and to combine the optimization and generalization error bounds.

What would settle it

A numerical counterexample or real dataset where the observed convergence rate of the last iterate falls short of the claimed minimax rate, or where increasing the projection dimension beyond the bias-variance optimum degrades performance contrary to the analysis.

Figures

Figures reproduced from arXiv: 2510.04237 by Andreas Christmann, Jinhui Bai, Lei Shi.

Figure 1
Figure 1. Figure 1: The left figure illustrates the convergence of the error with respect to the sample [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The left figure illustrates the convergence of the error with respect to the sample [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The left figure illustrates the convergence of the error with respect to the sample [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The two plots above show the sample-accuracy and time-accuracy, respectively. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
read the original abstract

In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a truncated kernel SGD algorithm that projects stochastic gradients onto a finite-dimensional space via spherical radial basis function expansions, with the truncation dimension chosen adaptively according to a bias-variance tradeoff. A new estimation of the spectral structure of the kernel-induced covariance operator is introduced to unify optimization and generalization analyses. The authors claim to prove minimax-optimal rates for both the last iterate and suffix average, plus optimal strong convergence in the RKHS, for general losses including least-squares, Huber, and logistic. The approach also achieves linear-time coordinate-wise updates and optimal storage for streaming data, with supporting numerical experiments.

Significance. If the central claims are established, the work would offer a scalable kernel method with rigorous guarantees that bridge optimization and statistical performance across non-quadratic losses. The attempt to derive adaptive truncation from a unified spectral analysis is a positive direction for large-scale kernel learning.

major comments (2)
  1. [Abstract and spectral estimation framework] Abstract and the section introducing the new spectral estimation: the unification of optimization and generalization bounds, as well as the justification for adaptive scaling of the finite-dimensional projection, rests on a new estimation of the spectral structure of the covariance operator. For general losses the stochastic gradient is modulated by the loss derivative evaluated at the current prediction; it is not immediate that the same spectral bounds hold without additional assumptions (e.g., uniform boundedness or Lipschitz conditions on the derivative that interact with kernel eigenvalues). Please provide the precise lemma establishing this extension and state any extra conditions required.
  2. [Main convergence theorems] Theorems on minimax-optimal rates (last iterate, suffix average, and strong RKHS convergence): these rates are asserted to follow from the spectral estimation and the bias-variance choice of truncation dimension. If the estimation is derived from the same data used to tune the adaptive scale, the rates risk becoming fitted quantities rather than independent predictions. Clarify whether the spectral bounds are obtained independently of the tuning data and how they remain valid under the weaker structural assumptions typical of non-quadratic losses.
minor comments (2)
  1. [Algorithm and complexity analysis] The description of coordinate-wise updates and streaming-data handling is clear, but a short complexity table comparing wall-clock time and memory against standard kernel SGD would improve readability.
  2. [Notation and preliminaries] Notation for the adaptive truncation scale and the finite-dimensional projection dimension should be introduced once and used consistently in all statements of the rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and spectral estimation framework] Abstract and the section introducing the new spectral estimation: the unification of optimization and generalization bounds, as well as the justification for adaptive scaling of the finite-dimensional projection, rests on a new estimation of the spectral structure of the covariance operator. For general losses the stochastic gradient is modulated by the loss derivative evaluated at the current prediction; it is not immediate that the same spectral bounds hold without additional assumptions (e.g., uniform boundedness or Lipschitz conditions on the derivative that interact with kernel eigenvalues). Please provide the precise lemma establishing this extension and state any extra conditions required.

    Authors: We thank the referee for this observation. The original manuscript implicitly relied on bounded loss derivatives for the losses considered but did not isolate the extension in a dedicated lemma. In the revision we add Lemma 3.2, which shows that if the loss derivative is uniformly bounded by a constant B (Assumption 2.3), then the operator-norm and trace bounds on the modulated covariance operator are identical to the quadratic case up to the factor B. This covers logistic loss (B=1/4), Huber loss (B equal to the threshold), and least-squares (B=1). The proof uses the fact that the modulation is a scalar multiplier whose absolute value is at most B, preserving the eigenvalue summability and decay rates of the underlying kernel operator. We also state the assumption explicitly and note that it is standard for these losses. revision: yes

  2. Referee: [Main convergence theorems] Theorems on minimax-optimal rates (last iterate, suffix average, and strong RKHS convergence): these rates are asserted to follow from the spectral estimation and the bias-variance choice of truncation dimension. If the estimation is derived from the same data used to tune the adaptive scale, the rates risk becoming fitted quantities rather than independent predictions. Clarify whether the spectral bounds are obtained independently of the tuning data and how they remain valid under the weaker structural assumptions typical of non-quadratic losses.

    Authors: The spectral bounds are derived from the population kernel covariance operator and its known eigenvalue decay (e.g., polynomial or exponential), which are data-independent. The adaptive truncation dimension m is selected via a theoretical bias-variance formula that uses only these population upper bounds, not empirical estimates computed from the same training sample that drives the SGD iterates. High-probability deviation between empirical and population spectra is controlled separately by matrix Bernstein inequalities in the proof, ensuring the final rates remain valid predictions rather than fitted quantities. For non-quadratic losses the same bounded-derivative assumption (now stated as Assumption 2.3) suffices to bound the variance term; no additional strong-convexity or Lipschitz-gradient assumptions are required beyond what is already used for the quadratic case. We have added a clarifying paragraph after Theorem 4.3 and a remark in Section 4.1. revision: yes

Circularity Check

0 steps flagged

No significant circularity; spectral estimation is an independent contribution

full rationale

The paper introduces a 'new estimation of the spectral structure of the kernel-induced covariance operator' as the foundation for its analytical framework that unifies optimization and generalization. This estimation is developed within the work and then applied to establish minimax-optimal rates for the last iterate and suffix average, plus optimal strong RKHS convergence. No quoted equations or self-citations reduce the central claims to tautologies or fitted inputs renamed as predictions. The accommodation of general losses (least-squares, Huber, logistic) is asserted under the paper's assumptions without evidence of self-definitional loops or load-bearing prior self-work that lacks external verification. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete and based on stated high-level assumptions rather than explicit equations.

free parameters (1)
  • adaptive truncation scale
    Chosen according to bias-variance trade-off to determine the finite-dimensional hypothesis space size.
axioms (2)
  • domain assumption The infinite series expansion of spherical radial basis functions permits an accurate finite-dimensional projection of the stochastic gradient.
    Invoked to justify the regularization strategy that improves efficiency and generalization.
  • ad hoc to paper A new estimation of the spectral structure of the kernel-induced covariance operator is sufficiently accurate to unify optimization and generalization bounds.
    Central to the analytical framework that delivers the claimed rates.

pith-pipeline@v0.9.0 · 5728 in / 1414 out tokens · 29561 ms · 2026-05-18T09:50:12.370328+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

  1. [1]

    Toward large kernel models

    Amirhesam Abedsoltan, Mikhail Belkin, and Parthe Pandit. Toward large kernel models. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 61–78. PMLR, 23–29 Jul 2023

  2. [2]

    Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.The Annals of Statistics, 47(4):2117–2144, 2019

    Pierre Alquier, Vincent Cottet, and Guillaume Lecu´ e. Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.The Annals of Statistics, 47(4):2117–2144, 2019

  3. [3]

    Faster kernel ridge regression using sketching and preconditioning.SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017

    Haim Avron, Kenneth Clarkson, and David Woodruff. Faster kernel ridge regression using sketching and preconditioning.SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017

  4. [4]

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning

    Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InProceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, page 451–459, Red Hook, NY, USA,

  5. [5]

    Curran Associates Inc

  6. [6]

    Truncated kernel stochastic gradient descent on spheres.Math- ematics of Computation, 2025

    Jinhui Bai and Lei Shi. Truncated kernel stochastic gradient descent on spheres.Math- ematics of Computation, 2025. Published online. 22

  7. [7]

    A general and adaptive robust loss function

    Jonathan Barron. A general and adaptive robust loss function. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  8. [8]

    Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006

    Peter Bartlett, Michael Jordan, and Jon Mcauliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006

  9. [9]

    Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: Mathematical Optimization Society (MOS), 2017

    Amir Beck.First-order Methods in Optimization. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: Mathematical Optimization Society (MOS), 2017

  10. [10]

    The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields.Computer Vision and Image Understanding, 63(1):75–104, 1996

    Michael Black and Padmanabhan Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields.Computer Vision and Image Understanding, 63(1):75–104, 1996

  11. [11]

    Optimal rates for regularization of statistical inverse learning problems.Foundations of Computational Mathematics, 18(4):971–1013, 2018

    Gilles Blanchard and Nicole M¨ ucke. Optimal rates for regularization of statistical inverse learning problems.Foundations of Computational Mathematics, 18(4):971–1013, 2018

  12. [12]

    Vladimir Bogachev.Measure Theory. Vol. I and II. Berlin: Springer, 2007

  13. [13]

    Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

    L´ eon Bottou, Frank Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

  14. [14]

    Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

    Andrea Caponnetto and Ernesto Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

  15. [15]

    Consistency and robustness of kernel-based regression in convex risk minimization.Bernoulli, 13(3):799–819, 2007

    Andreas Christmann and Ingo Steinwart. Consistency and robustness of kernel-based regression in convex risk minimization.Bernoulli, 13(3):799–819, 2007

  16. [16]

    Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2013

    Philippe Ciarlet.Linear and Nonlinear Functional Analysis with Applications, volume 130 ofOther Titles in Applied Mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2013

  17. [17]

    edition, 2001

    Nello Cristianini and John Shawe-Taylor.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.Cambridge: Cambridge University Press, repr. edition, 2001

  18. [18]

    On the mathematical foundations of learning.Bulletin of the American Mathematical Society

    Felipe Cucker and Steve Smale. On the mathematical foundations of learning.Bulletin of the American Mathematical Society. New Series, 39(1):1–49, 2002

  19. [19]

    Springer Monographs in Mathematics

    Feng Dai and Yuan Xu.Approximation Theory and Harmonic Analysis on Spheres and Balls. Springer Monographs in Mathematics. New York, NY: Springer, 2013

  20. [20]

    Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363–1399, 2016

    Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363–1399, 2016

  21. [21]

    Error bounds for the asymptotic expansion of the ratio of two gamma functions with complex argument.SIAM Journal on Mathematical Analysis, 23(2):505– 511, 1992

    Chris Frenzen. Error bounds for the asymptotic expansion of the ratio of two gamma functions with complex argument.SIAM Journal on Mathematical Analysis, 23(2):505– 511, 1992

  22. [22]

    The analysis of rates using Poisson regression models.Biometrics, 39:665–674, 1983

    Edward Frome. The analysis of rates using Poisson regression models.Biometrics, 39:665–674, 1983. 23

  23. [23]

    Improved estimates of global ocean circulation, heat transport and mixing from hydrographic data.Nature, 408(6811):453–457, 2000

    Alexandre Ganachaud and Carl Wunsch. Improved estimates of global ocean circulation, heat transport and mixing from hydrographic data.Nature, 408(6811):453–457, 2000

  24. [24]

    Optimality of robust online learning

    Zheng-Chu Guo, Andreas Christmann, and Lei Shi. Optimality of robust online learning. Foundations of Computational Mathematics, 24(5):1455–1483, 2024

  25. [25]

    Fast and strong convergence of online learning algorithms

    Zheng-Chu Guo and Lei Shi. Fast and strong convergence of online learning algorithms. Advances in Computational Mathematics, 45(5-6):2745–2770, 2019

  26. [26]

    John Wiley & Sons, Hoboken, NJ, 1986

    Frank Hampel, Elvezio Ronchetti, Peter Rousseeuw, and Werner Stahel.Robust Statis- tics: The Approach Based on Influence Functions. John Wiley & Sons, Hoboken, NJ, 1986

  27. [27]

    Radial basis function approximation of noisy scattered data on the sphere.Numerische Mathematik, 137(3):579–605, 2017

    Kerstin Hesse, Ian Sloan, and Robert Womersley. Radial basis function approximation of noisy scattered data on the sphere.Numerische Mathematik, 137(3):579–605, 2017

  28. [28]

    Robust regression using iteratively reweighted least- squares.Communications in Statistics - Theory and Methods, 6(9):813–827, 1977

    Paul Holland and Roy Welsch. Robust regression using iteratively reweighted least- squares.Communications in Statistics - Theory and Methods, 6(9):813–827, 1977

  29. [29]

    Cosmic microwave background anisotropies.Annual Review of Astronomy and Astrophysics, 40(1):171–216, 2002

    Wayne Hu and Scott Dodelson. Cosmic microwave background anisotropies.Annual Review of Astronomy and Astrophysics, 40(1):171–216, 2002

  30. [30]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

  31. [31]

    Making the last iterate of SGD information theoretically optimal.SIAM Journal on Optimization, 31(2):1108– 1130, 2021

    Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of SGD information theoretically optimal.SIAM Journal on Optimization, 31(2):1108– 1130, 2021

  32. [32]

    Cham: Springer, 7th edition edition, 2017

    J¨ urgen Jost.Riemannian Geometry and Geometric Analysis. Cham: Springer, 7th edition edition, 2017

  33. [33]

    Online learning with kernels

    Jyrki Kivinen, Alexander Smola, and Robert Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004

  34. [34]

    Cham: Springer, 2020

    Guanghui Lan.First-order and Stochastic Optimization Methods for Machine Learning. Cham: Springer, 2020

  35. [35]

    New York, NY: Springer, 2nd revised ed edition, 2013

    John Lee.Introduction to Smooth Manifolds, volume 218. New York, NY: Springer, 2nd revised ed edition, 2013

  36. [36]

    Kernel interpolation of high dimen- sional scattered data.SIAM Journal on Numerical Analysis, 62(3):1098–1118, 2024

    Shao-Bo Lin, Xiangyu Chang, and Xingping Sun. Kernel interpolation of high dimen- sional scattered data.SIAM Journal on Numerical Analysis, 62(3):1098–1118, 2024

  37. [37]

    Be- yond least-squares: fast rates for regularized empirical risk minimization through self- concordance

    Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Be- yond least-squares: fast rates for regularized empirical risk minimization through self- concordance. InProceedings of the Thirty-Second Conference on Learning Theory, vol- ume 99 ofProceedings of Machine Learning Research, pages 2294–2340. PMLR, 25–28 Jun 2019

  38. [38]

    Nonparametric regression for spher- ical data.Journal of the American Statistical Association, 109(506):748–763, 2014

    Marco Marzio, Agnese Panzera, and Charles Taylor. Nonparametric regression for spher- ical data.Journal of the American Statistical Association, 109(506):748–763, 2014. 24

  39. [39]

    Nonparametric rotations for sphere- sphere regression.Journal of the American Statistical Association, 114(525):466–476, 2019

    Marco Marzio, Agnese Panzera, and Charles Taylor. Nonparametric rotations for sphere- sphere regression.Journal of the American Statistical Association, 114(525):466–476, 2019

  40. [40]

    Regularization in kernel learning.The Annals of Statistics, 38(1):526–565, 2010

    Shahar Mendelson and Joseph Neeman. Regularization in kernel learning.The Annals of Statistics, 38(1):526–565, 2010

  41. [41]

    Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950

    Aronszajn Nachman. Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950

  42. [42]

    A Basic Course., vol- ume 87 ofApplied Optimization

    Yurii Nesterov.Introductory Lectures on Convex Optimization. A Basic Course., vol- ume 87 ofApplied Optimization. Boston: Kluwer Academic Publishers, 2004

  43. [43]

    Cambridge: Cambridge University Press, 2010

    Frank Olver, Daniel Lozier, Ronald Boisvert, and Charles Clark, editors.NIST Handbook of Mathematical Functions. Cambridge: Cambridge University Press, 2010

  44. [44]

    Spatially adaptive smoothing splines.Biometrika, 93(1):113–125, 2006

    Alexandre Pintore, Paul Speckman, and Chris Holmes. Spatially adaptive smoothing splines.Biometrika, 93(1):113–125, 2006

  45. [45]

    Acceleration of stochastic approximation by averag- ing.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

    Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averag- ing.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

  46. [46]

    Early stopping and non-parametric regression: an optimal data-dependent stopping rule.Journal of Machine Learning Re- search, 15:335–366, 2014

    Garvesh Raskutti, Martin Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule.Journal of Machine Learning Re- search, 15:335–366, 2014

  47. [47]

    A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407, 1951

  48. [48]

    Spherical regression models using projective linear transformations.Journal of the American Statistical As- sociation, 109(508):1615–1624, 2014

    Michael Rosenthal, Wei Wu, Eric Klassen, and Anuj Srivastava. Spherical regression models using projective linear transformations.Journal of the American Statistical As- sociation, 109(508):1615–1624, 2014

  49. [49]

    On fast leverage score sampling and optimal learning

    Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast leverage score sampling and optimal learning. InAdvances in Neural Information Pro- cessing Systems, volume 31. Curran Associates, Inc., 2018

  50. [50]

    FALKON: an optimal large scale kernel method

    Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: an optimal large scale kernel method. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  51. [51]

    Sparse Cholesky factorization by Kullback-Leibler minimization.SIAM Journal on Scientific Computing, 43(3):a2019– a2046, 2021

    Florian Sch¨ afer, Matthias Katzfuss, and Houman Owhadi. Sparse Cholesky factorization by Kullback-Leibler minimization.SIAM Journal on Scientific Computing, 43(3):a2019– a2046, 2021

  52. [52]

    The MIT Press, 12 2001

    Bernhard Sch¨ olkopf and Alexander Smola.Learning with Kernels: Support Vector Ma- chines, Regularization, Optimization, and Beyond. The MIT Press, 12 2001

  53. [53]

    Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

    Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InProceedings of the 30th Interna- tional Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 71–79, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. 25

  54. [54]

    Online learning algorithms.Foundations of Computational Mathematics, 6(2):145–170, 2006

    Steve Smale and Yuan Yao. Online learning algorithms.Foundations of Computational Mathematics, 6(2):145–170, 2006

  55. [55]

    Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26(2):153–172, 2007

    Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26(2):153–172, 2007

  56. [56]

    New York, NY: Springer, 2008

    Ingo Steinwart and Andreas Christmann.Support Vector Machines. New York, NY: Springer, 2008

  57. [57]

    Online learning as stochastic approximation of regulariza- tion paths: optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

    Pierre Tarr` es and Yuan Yao. Online learning as stochastic approximation of regulariza- tion paths: optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

  58. [58]

    The nystr¨ om method for convex loss functions.Journal of Machine Learning Research, 25(360):1–60, 2024

    Andrea Vecchia, Ernesto Vito, Jaouad Mourtada, and Lorenzo Rosasco. The nystr¨ om method for convex loss functions.Journal of Machine Learning Research, 25(360):1–60, 2024

  59. [59]

    Cam- bridge Series in Statistical and Probabilistic Mathematics

    Martin Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam- bridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019

  60. [60]

    Marginal longitudinal nonparametric regression: locality and efficiency of spline and kernel methods.Journal of the American Statistical Association, 97(458):482–493, 2002

    Alan Welsh, Xihong Lin, and Raymond Carroll. Marginal longitudinal nonparametric regression: locality and efficiency of spline and kernel methods.Journal of the American Statistical Association, 97(458):482–493, 2002

  61. [61]

    Benefits of early stopping in gradient descent for overparameterized logistic regression

    Jingfeng Wu, Peter Bartlett, Matus Telgarsky, and Bin Yu. Benefits of early stopping in gradient descent for overparameterized logistic regression. InForty-second International Conference on Machine Learning, 2025

  62. [62]

    Robust truncated hinge loss support vector machines.Jour- nal of the American Statistical Association, 102(479):974–983, 2007

    Yichao Wu and Yufeng Liu. Robust truncated hinge loss support vector machines.Jour- nal of the American Statistical Association, 102(479):974–983, 2007

  63. [63]

    Randomized sketches for kernels: fast and optimal nonparametric regression.The Annals of Statistics, 45(3):991–1023, 2017

    Yun Yang, Mert Pilanci, and Martin Wainwright. Randomized sketches for kernels: fast and optimal nonparametric regression.The Annals of Statistics, 45(3):991–1023, 2017

  64. [64]

    Online gradient descent learning algorithms

    Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008

  65. [65]

    A sieve stochastic gradient descent estimator for online nonparametric regression in Sobolev ellipsoids.The Annals of Statistics, 50(5):2848–2871, 2022

    Tianyu Zhang and Noah Simon. A sieve stochastic gradient descent estimator for online nonparametric regression in Sobolev ellipsoids.The Annals of Statistics, 50(5):2848–2871, 2022

  66. [66]

    Statistical behavior and consistency of classification methods based on convex risk minimization.The Annals of Statistics, 32(1):56–85, 2004

    Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization.The Annals of Statistics, 32(1):56–85, 2004

  67. [67]

    Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates.Journal of Machine Learning Research, 16:3299–3340, 2015

    Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates.Journal of Machine Learning Research, 16:3299–3340, 2015

  68. [68]

    8eC γ0 +γ 0M 2 1 # (n+ 1) − 2r 2r+1 log(n+ 2).(A.41) 49 Combining the definition ofS k with (A.41) yields kSk−1 = (k+ 1)S k −E h E ˆfn−k i =kS k + Sk −E h E ˆfn−k i ≤kS k +

    Libin Zhu, Chaoyue Liu, and Mikhail Belkin. Transition to linearity of general neural networks with directed acyclic graph architecture. InAdvances in Neural Information Processing Systems, volume 35, pages 5363–5375. Curran Associates, Inc., 2022. 26 A Appendix A.1 Preliminaries In this section, we present the explicit expressions of spherical harmonics ...