Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions

Andreas Christmann; Jinhui Bai; Lei Shi

arxiv: 2510.04237 · v6 · submitted 2025-10-05 · 💻 cs.LG

Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions

Jinhui Bai , Andreas Christmann , Lei Shi This is my paper

Pith reviewed 2026-05-18 09:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords kernel SGDstochastic gradient descentspherical radial basis functionsgeneralization boundsreproducing kernel Hilbert spaceconvergence rateslarge-scale learningstreaming data

0 comments

The pith

Truncated kernel SGD projects stochastic gradients onto adaptively scaled finite-dimensional spaces using spherical radial basis functions and achieves minimax-optimal convergence rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a kernel stochastic gradient descent method that uses an infinite series expansion of spherical radial basis functions to project gradients into a finite hypothesis space. This projection is scaled adaptively based on the bias-variance trade-off to improve generalization. A new analysis of the spectral properties of the kernel covariance operator unifies the optimization and statistical generalization bounds. The authors prove that the last iterate and the suffix average of the iterates converge at the best possible rates, along with strong convergence in the reproducing kernel Hilbert space. The approach handles common loss functions and offers better efficiency for large datasets and streaming data by using coordinate-wise updates instead of full kernel matrix operations.

Core claim

By projecting the stochastic gradient onto a finite-dimensional space via the series expansion of spherical radial basis functions and using a refined estimate of the eigenvalue decay of the covariance operator, the algorithm attains minimax optimal rates for both the last iterate and the averaged iterates, as well as optimal rates in the RKHS norm, for a wide range of convex loss functions.

What carries the argument

Adaptive truncation of the kernel gradient using the infinite series of spherical radial basis functions, driven by a new spectral estimate of the covariance operator to balance bias and variance.

If this is right

The last iterate of the algorithm converges at the minimax-optimal rate.
The suffix average also achieves the optimal rate.
Optimal strong convergence holds in the reproducing kernel Hilbert space.
The method works for least-squares, Huber, and logistic losses.
Computational and storage complexity are reduced to linear in the sample size, supporting streaming data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the spectral estimation holds for other kernel families, similar truncation strategies could extend to non-radial kernels.
The coordinate-wise update structure suggests natural parallels to coordinate descent methods in high-dimensional statistics.
Testing the adaptive scaling rule on datasets with varying noise levels could confirm the bias-variance justification in practice.
Strong RKHS convergence may imply faster rates for downstream tasks like model selection or uncertainty quantification.

Load-bearing premise

The estimate of the spectral decay of the kernel-induced covariance operator must be sufficiently precise to correctly guide the adaptive choice of projection dimension and to combine the optimization and generalization error bounds.

What would settle it

A numerical counterexample or real dataset where the observed convergence rate of the last iterate falls short of the claimed minimax rate, or where increasing the projection dimension beyond the bias-variance optimum degrades performance contrary to the analysis.

Figures

Figures reproduced from arXiv: 2510.04237 by Andreas Christmann, Jinhui Bai, Lei Shi.

**Figure 2.** Figure 2: The left figure illustrates the convergence of the error with respect to the sample [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: The left figure illustrates the convergence of the error with respect to the sample [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: The two plots above show the sample-accuracy and time-accuracy, respectively. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This truncated kernel SGD with spherical RBFs and spectral analysis for general losses is a solid technical step but hinges on whether the covariance estimates extend beyond least squares.

read the letter

The core takeaway is that this work develops a truncated kernel SGD algorithm using spherical radial basis function expansions to handle general losses efficiently in large-scale and streaming settings, with claims of minimax optimal rates and strong convergence. The new elements include the truncation strategy that projects gradients into a finite-dimensional space scaled adaptively by bias-variance considerations, combined with coordinate-wise updates drawn from linear SGD to cut down on storage and pairwise kernel computations. The unified spectral analysis based on a fresh estimate of the covariance operator's structure is what allows them to tie optimization and generalization together for losses like Huber and logistic in addition to least squares. That part looks like a genuine step forward if the estimates hold. Where it might be soft is in the extension of the spectral estimation to general losses. For squared loss the covariance operator directly controls the excess risk, but with other losses the stochastic gradient is scaled by the loss derivative evaluated at the prediction. The paper asserts the framework covers these cases, yet if the new spectral bounds require extra conditions on the derivative or Lipschitz properties that aren't standard for all losses, the claimed unification and the justification for the adaptive truncation could weaken. The abstract states the results clearly, but the derivations would need to be checked for any gaps there. This paper targets researchers focused on scalable nonparametric methods and online kernel learning. It brings together theoretical rates with practical efficiency improvements, so it merits a serious referee even if revisions are needed on the general loss proofs. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a truncated kernel SGD algorithm that projects stochastic gradients onto a finite-dimensional space via spherical radial basis function expansions, with the truncation dimension chosen adaptively according to a bias-variance tradeoff. A new estimation of the spectral structure of the kernel-induced covariance operator is introduced to unify optimization and generalization analyses. The authors claim to prove minimax-optimal rates for both the last iterate and suffix average, plus optimal strong convergence in the RKHS, for general losses including least-squares, Huber, and logistic. The approach also achieves linear-time coordinate-wise updates and optimal storage for streaming data, with supporting numerical experiments.

Significance. If the central claims are established, the work would offer a scalable kernel method with rigorous guarantees that bridge optimization and statistical performance across non-quadratic losses. The attempt to derive adaptive truncation from a unified spectral analysis is a positive direction for large-scale kernel learning.

major comments (2)

[Abstract and spectral estimation framework] Abstract and the section introducing the new spectral estimation: the unification of optimization and generalization bounds, as well as the justification for adaptive scaling of the finite-dimensional projection, rests on a new estimation of the spectral structure of the covariance operator. For general losses the stochastic gradient is modulated by the loss derivative evaluated at the current prediction; it is not immediate that the same spectral bounds hold without additional assumptions (e.g., uniform boundedness or Lipschitz conditions on the derivative that interact with kernel eigenvalues). Please provide the precise lemma establishing this extension and state any extra conditions required.
[Main convergence theorems] Theorems on minimax-optimal rates (last iterate, suffix average, and strong RKHS convergence): these rates are asserted to follow from the spectral estimation and the bias-variance choice of truncation dimension. If the estimation is derived from the same data used to tune the adaptive scale, the rates risk becoming fitted quantities rather than independent predictions. Clarify whether the spectral bounds are obtained independently of the tuning data and how they remain valid under the weaker structural assumptions typical of non-quadratic losses.

minor comments (2)

[Algorithm and complexity analysis] The description of coordinate-wise updates and streaming-data handling is clear, but a short complexity table comparing wall-clock time and memory against standard kernel SGD would improve readability.
[Notation and preliminaries] Notation for the adaptive truncation scale and the finite-dimensional projection dimension should be introduced once and used consistently in all statements of the rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and spectral estimation framework] Abstract and the section introducing the new spectral estimation: the unification of optimization and generalization bounds, as well as the justification for adaptive scaling of the finite-dimensional projection, rests on a new estimation of the spectral structure of the covariance operator. For general losses the stochastic gradient is modulated by the loss derivative evaluated at the current prediction; it is not immediate that the same spectral bounds hold without additional assumptions (e.g., uniform boundedness or Lipschitz conditions on the derivative that interact with kernel eigenvalues). Please provide the precise lemma establishing this extension and state any extra conditions required.

Authors: We thank the referee for this observation. The original manuscript implicitly relied on bounded loss derivatives for the losses considered but did not isolate the extension in a dedicated lemma. In the revision we add Lemma 3.2, which shows that if the loss derivative is uniformly bounded by a constant B (Assumption 2.3), then the operator-norm and trace bounds on the modulated covariance operator are identical to the quadratic case up to the factor B. This covers logistic loss (B=1/4), Huber loss (B equal to the threshold), and least-squares (B=1). The proof uses the fact that the modulation is a scalar multiplier whose absolute value is at most B, preserving the eigenvalue summability and decay rates of the underlying kernel operator. We also state the assumption explicitly and note that it is standard for these losses. revision: yes
Referee: [Main convergence theorems] Theorems on minimax-optimal rates (last iterate, suffix average, and strong RKHS convergence): these rates are asserted to follow from the spectral estimation and the bias-variance choice of truncation dimension. If the estimation is derived from the same data used to tune the adaptive scale, the rates risk becoming fitted quantities rather than independent predictions. Clarify whether the spectral bounds are obtained independently of the tuning data and how they remain valid under the weaker structural assumptions typical of non-quadratic losses.

Authors: The spectral bounds are derived from the population kernel covariance operator and its known eigenvalue decay (e.g., polynomial or exponential), which are data-independent. The adaptive truncation dimension m is selected via a theoretical bias-variance formula that uses only these population upper bounds, not empirical estimates computed from the same training sample that drives the SGD iterates. High-probability deviation between empirical and population spectra is controlled separately by matrix Bernstein inequalities in the proof, ensuring the final rates remain valid predictions rather than fitted quantities. For non-quadratic losses the same bounded-derivative assumption (now stated as Assumption 2.3) suffices to bound the variance term; no additional strong-convexity or Lipschitz-gradient assumptions are required beyond what is already used for the quadratic case. We have added a clarifying paragraph after Theorem 4.3 and a remark in Section 4.1. revision: yes

Circularity Check

0 steps flagged

No significant circularity; spectral estimation is an independent contribution

full rationale

The paper introduces a 'new estimation of the spectral structure of the kernel-induced covariance operator' as the foundation for its analytical framework that unifies optimization and generalization. This estimation is developed within the work and then applied to establish minimax-optimal rates for the last iterate and suffix average, plus optimal strong RKHS convergence. No quoted equations or self-citations reduce the central claims to tautologies or fitted inputs renamed as predictions. The accommodation of general losses (least-squares, Huber, logistic) is asserted under the paper's assumptions without evidence of self-definitional loops or load-bearing prior self-work that lacks external verification. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete and based on stated high-level assumptions rather than explicit equations.

free parameters (1)

adaptive truncation scale
Chosen according to bias-variance trade-off to determine the finite-dimensional hypothesis space size.

axioms (2)

domain assumption The infinite series expansion of spherical radial basis functions permits an accurate finite-dimensional projection of the stochastic gradient.
Invoked to justify the regularization strategy that improves efficiency and generalization.
ad hoc to paper A new estimation of the spectral structure of the kernel-induced covariance operator is sufficiently accurate to unify optimization and generalization bounds.
Central to the analytical framework that delivers the claimed rates.

pith-pipeline@v0.9.0 · 5728 in / 1414 out tokens · 29561 ms · 2026-05-18T09:50:12.370328+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that both the last iterate and the suffix average converge at minimax-optimal rates... under local μ-strong convexity and L-smoothness.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

[1]

Toward large kernel models

Amirhesam Abedsoltan, Mikhail Belkin, and Parthe Pandit. Toward large kernel models. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 61–78. PMLR, 23–29 Jul 2023

work page 2023
[2]

Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.The Annals of Statistics, 47(4):2117–2144, 2019

Pierre Alquier, Vincent Cottet, and Guillaume Lecu´ e. Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.The Annals of Statistics, 47(4):2117–2144, 2019

work page 2019
[3]

Faster kernel ridge regression using sketching and preconditioning.SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017

Haim Avron, Kenneth Clarkson, and David Woodruff. Faster kernel ridge regression using sketching and preconditioning.SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017

work page 2017
[4]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InProceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, page 451–459, Red Hook, NY, USA,

work page
[5]

Curran Associates Inc

work page
[6]

Truncated kernel stochastic gradient descent on spheres.Math- ematics of Computation, 2025

Jinhui Bai and Lei Shi. Truncated kernel stochastic gradient descent on spheres.Math- ematics of Computation, 2025. Published online. 22

work page 2025
[7]

A general and adaptive robust loss function

Jonathan Barron. A general and adaptive robust loss function. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019
[8]

Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006

Peter Bartlett, Michael Jordan, and Jon Mcauliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006

work page 2006
[9]

Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: Mathematical Optimization Society (MOS), 2017

Amir Beck.First-order Methods in Optimization. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: Mathematical Optimization Society (MOS), 2017

work page 2017
[10]

The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields.Computer Vision and Image Understanding, 63(1):75–104, 1996

Michael Black and Padmanabhan Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields.Computer Vision and Image Understanding, 63(1):75–104, 1996

work page 1996
[11]

Optimal rates for regularization of statistical inverse learning problems.Foundations of Computational Mathematics, 18(4):971–1013, 2018

Gilles Blanchard and Nicole M¨ ucke. Optimal rates for regularization of statistical inverse learning problems.Foundations of Computational Mathematics, 18(4):971–1013, 2018

work page 2018
[12]

Vladimir Bogachev.Measure Theory. Vol. I and II. Berlin: Springer, 2007

work page 2007
[13]

Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

L´ eon Bottou, Frank Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

work page 2018
[14]

Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

Andrea Caponnetto and Ernesto Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

work page 2007
[15]

Consistency and robustness of kernel-based regression in convex risk minimization.Bernoulli, 13(3):799–819, 2007

Andreas Christmann and Ingo Steinwart. Consistency and robustness of kernel-based regression in convex risk minimization.Bernoulli, 13(3):799–819, 2007

work page 2007
[16]

Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2013

Philippe Ciarlet.Linear and Nonlinear Functional Analysis with Applications, volume 130 ofOther Titles in Applied Mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2013

work page 2013
[17]

edition, 2001

Nello Cristianini and John Shawe-Taylor.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.Cambridge: Cambridge University Press, repr. edition, 2001

work page 2001
[18]

On the mathematical foundations of learning.Bulletin of the American Mathematical Society

Felipe Cucker and Steve Smale. On the mathematical foundations of learning.Bulletin of the American Mathematical Society. New Series, 39(1):1–49, 2002

work page 2002
[19]

Springer Monographs in Mathematics

Feng Dai and Yuan Xu.Approximation Theory and Harmonic Analysis on Spheres and Balls. Springer Monographs in Mathematics. New York, NY: Springer, 2013

work page 2013
[20]

Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363–1399, 2016

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363–1399, 2016

work page 2016
[21]

Error bounds for the asymptotic expansion of the ratio of two gamma functions with complex argument.SIAM Journal on Mathematical Analysis, 23(2):505– 511, 1992

Chris Frenzen. Error bounds for the asymptotic expansion of the ratio of two gamma functions with complex argument.SIAM Journal on Mathematical Analysis, 23(2):505– 511, 1992

work page 1992
[22]

The analysis of rates using Poisson regression models.Biometrics, 39:665–674, 1983

Edward Frome. The analysis of rates using Poisson regression models.Biometrics, 39:665–674, 1983. 23

work page 1983
[23]

Improved estimates of global ocean circulation, heat transport and mixing from hydrographic data.Nature, 408(6811):453–457, 2000

Alexandre Ganachaud and Carl Wunsch. Improved estimates of global ocean circulation, heat transport and mixing from hydrographic data.Nature, 408(6811):453–457, 2000

work page 2000
[24]

Optimality of robust online learning

Zheng-Chu Guo, Andreas Christmann, and Lei Shi. Optimality of robust online learning. Foundations of Computational Mathematics, 24(5):1455–1483, 2024

work page 2024
[25]

Fast and strong convergence of online learning algorithms

Zheng-Chu Guo and Lei Shi. Fast and strong convergence of online learning algorithms. Advances in Computational Mathematics, 45(5-6):2745–2770, 2019

work page 2019
[26]

John Wiley & Sons, Hoboken, NJ, 1986

Frank Hampel, Elvezio Ronchetti, Peter Rousseeuw, and Werner Stahel.Robust Statis- tics: The Approach Based on Influence Functions. John Wiley & Sons, Hoboken, NJ, 1986

work page 1986
[27]

Radial basis function approximation of noisy scattered data on the sphere.Numerische Mathematik, 137(3):579–605, 2017

Kerstin Hesse, Ian Sloan, and Robert Womersley. Radial basis function approximation of noisy scattered data on the sphere.Numerische Mathematik, 137(3):579–605, 2017

work page 2017
[28]

Robust regression using iteratively reweighted least- squares.Communications in Statistics - Theory and Methods, 6(9):813–827, 1977

Paul Holland and Roy Welsch. Robust regression using iteratively reweighted least- squares.Communications in Statistics - Theory and Methods, 6(9):813–827, 1977

work page 1977
[29]

Cosmic microwave background anisotropies.Annual Review of Astronomy and Astrophysics, 40(1):171–216, 2002

Wayne Hu and Scott Dodelson. Cosmic microwave background anisotropies.Annual Review of Astronomy and Astrophysics, 40(1):171–216, 2002

work page 2002
[30]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018
[31]

Making the last iterate of SGD information theoretically optimal.SIAM Journal on Optimization, 31(2):1108– 1130, 2021

Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of SGD information theoretically optimal.SIAM Journal on Optimization, 31(2):1108– 1130, 2021

work page 2021
[32]

Cham: Springer, 7th edition edition, 2017

J¨ urgen Jost.Riemannian Geometry and Geometric Analysis. Cham: Springer, 7th edition edition, 2017

work page 2017
[33]

Online learning with kernels

Jyrki Kivinen, Alexander Smola, and Robert Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004

work page 2004
[34]

Cham: Springer, 2020

Guanghui Lan.First-order and Stochastic Optimization Methods for Machine Learning. Cham: Springer, 2020

work page 2020
[35]

New York, NY: Springer, 2nd revised ed edition, 2013

John Lee.Introduction to Smooth Manifolds, volume 218. New York, NY: Springer, 2nd revised ed edition, 2013

work page 2013
[36]

Kernel interpolation of high dimen- sional scattered data.SIAM Journal on Numerical Analysis, 62(3):1098–1118, 2024

Shao-Bo Lin, Xiangyu Chang, and Xingping Sun. Kernel interpolation of high dimen- sional scattered data.SIAM Journal on Numerical Analysis, 62(3):1098–1118, 2024

work page 2024
[37]

Be- yond least-squares: fast rates for regularized empirical risk minimization through self- concordance

Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Be- yond least-squares: fast rates for regularized empirical risk minimization through self- concordance. InProceedings of the Thirty-Second Conference on Learning Theory, vol- ume 99 ofProceedings of Machine Learning Research, pages 2294–2340. PMLR, 25–28 Jun 2019

work page 2019
[38]

Nonparametric regression for spher- ical data.Journal of the American Statistical Association, 109(506):748–763, 2014

Marco Marzio, Agnese Panzera, and Charles Taylor. Nonparametric regression for spher- ical data.Journal of the American Statistical Association, 109(506):748–763, 2014. 24

work page 2014
[39]

Nonparametric rotations for sphere- sphere regression.Journal of the American Statistical Association, 114(525):466–476, 2019

Marco Marzio, Agnese Panzera, and Charles Taylor. Nonparametric rotations for sphere- sphere regression.Journal of the American Statistical Association, 114(525):466–476, 2019

work page 2019
[40]

Regularization in kernel learning.The Annals of Statistics, 38(1):526–565, 2010

Shahar Mendelson and Joseph Neeman. Regularization in kernel learning.The Annals of Statistics, 38(1):526–565, 2010

work page 2010
[41]

Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950

Aronszajn Nachman. Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950

work page 1950
[42]

A Basic Course., vol- ume 87 ofApplied Optimization

Yurii Nesterov.Introductory Lectures on Convex Optimization. A Basic Course., vol- ume 87 ofApplied Optimization. Boston: Kluwer Academic Publishers, 2004

work page 2004
[43]

Cambridge: Cambridge University Press, 2010

Frank Olver, Daniel Lozier, Ronald Boisvert, and Charles Clark, editors.NIST Handbook of Mathematical Functions. Cambridge: Cambridge University Press, 2010

work page 2010
[44]

Spatially adaptive smoothing splines.Biometrika, 93(1):113–125, 2006

Alexandre Pintore, Paul Speckman, and Chris Holmes. Spatially adaptive smoothing splines.Biometrika, 93(1):113–125, 2006

work page 2006
[45]

Acceleration of stochastic approximation by averag- ing.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averag- ing.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

work page 1992
[46]

Early stopping and non-parametric regression: an optimal data-dependent stopping rule.Journal of Machine Learning Re- search, 15:335–366, 2014

Garvesh Raskutti, Martin Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule.Journal of Machine Learning Re- search, 15:335–366, 2014

work page 2014
[47]

A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407, 1951

work page 1951
[48]

Spherical regression models using projective linear transformations.Journal of the American Statistical As- sociation, 109(508):1615–1624, 2014

Michael Rosenthal, Wei Wu, Eric Klassen, and Anuj Srivastava. Spherical regression models using projective linear transformations.Journal of the American Statistical As- sociation, 109(508):1615–1624, 2014

work page 2014
[49]

On fast leverage score sampling and optimal learning

Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast leverage score sampling and optimal learning. InAdvances in Neural Information Pro- cessing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018
[50]

FALKON: an optimal large scale kernel method

Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: an optimal large scale kernel method. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[51]

Sparse Cholesky factorization by Kullback-Leibler minimization.SIAM Journal on Scientific Computing, 43(3):a2019– a2046, 2021

Florian Sch¨ afer, Matthias Katzfuss, and Houman Owhadi. Sparse Cholesky factorization by Kullback-Leibler minimization.SIAM Journal on Scientific Computing, 43(3):a2019– a2046, 2021

work page 2021
[52]

The MIT Press, 12 2001

Bernhard Sch¨ olkopf and Alexander Smola.Learning with Kernels: Support Vector Ma- chines, Regularization, Optimization, and Beyond. The MIT Press, 12 2001

work page 2001
[53]

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InProceedings of the 30th Interna- tional Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 71–79, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. 25

work page 2013
[54]

Online learning algorithms.Foundations of Computational Mathematics, 6(2):145–170, 2006

Steve Smale and Yuan Yao. Online learning algorithms.Foundations of Computational Mathematics, 6(2):145–170, 2006

work page 2006
[55]

Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26(2):153–172, 2007

Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26(2):153–172, 2007

work page 2007
[56]

New York, NY: Springer, 2008

Ingo Steinwart and Andreas Christmann.Support Vector Machines. New York, NY: Springer, 2008

work page 2008
[57]

Online learning as stochastic approximation of regulariza- tion paths: optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

Pierre Tarr` es and Yuan Yao. Online learning as stochastic approximation of regulariza- tion paths: optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

work page 2014
[58]

The nystr¨ om method for convex loss functions.Journal of Machine Learning Research, 25(360):1–60, 2024

Andrea Vecchia, Ernesto Vito, Jaouad Mourtada, and Lorenzo Rosasco. The nystr¨ om method for convex loss functions.Journal of Machine Learning Research, 25(360):1–60, 2024

work page 2024
[59]

Cam- bridge Series in Statistical and Probabilistic Mathematics

Martin Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam- bridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019

work page 2019
[60]

Marginal longitudinal nonparametric regression: locality and efficiency of spline and kernel methods.Journal of the American Statistical Association, 97(458):482–493, 2002

Alan Welsh, Xihong Lin, and Raymond Carroll. Marginal longitudinal nonparametric regression: locality and efficiency of spline and kernel methods.Journal of the American Statistical Association, 97(458):482–493, 2002

work page 2002
[61]

Benefits of early stopping in gradient descent for overparameterized logistic regression

Jingfeng Wu, Peter Bartlett, Matus Telgarsky, and Bin Yu. Benefits of early stopping in gradient descent for overparameterized logistic regression. InForty-second International Conference on Machine Learning, 2025

work page 2025
[62]

Robust truncated hinge loss support vector machines.Jour- nal of the American Statistical Association, 102(479):974–983, 2007

Yichao Wu and Yufeng Liu. Robust truncated hinge loss support vector machines.Jour- nal of the American Statistical Association, 102(479):974–983, 2007

work page 2007
[63]

Randomized sketches for kernels: fast and optimal nonparametric regression.The Annals of Statistics, 45(3):991–1023, 2017

Yun Yang, Mert Pilanci, and Martin Wainwright. Randomized sketches for kernels: fast and optimal nonparametric regression.The Annals of Statistics, 45(3):991–1023, 2017

work page 2017
[64]

Online gradient descent learning algorithms

Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008

work page 2008
[65]

A sieve stochastic gradient descent estimator for online nonparametric regression in Sobolev ellipsoids.The Annals of Statistics, 50(5):2848–2871, 2022

Tianyu Zhang and Noah Simon. A sieve stochastic gradient descent estimator for online nonparametric regression in Sobolev ellipsoids.The Annals of Statistics, 50(5):2848–2871, 2022

work page 2022
[66]

Statistical behavior and consistency of classification methods based on convex risk minimization.The Annals of Statistics, 32(1):56–85, 2004

Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization.The Annals of Statistics, 32(1):56–85, 2004

work page 2004
[67]

Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates.Journal of Machine Learning Research, 16:3299–3340, 2015

Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates.Journal of Machine Learning Research, 16:3299–3340, 2015

work page 2015
[68]

8eC γ0 +γ 0M 2 1 # (n+ 1) − 2r 2r+1 log(n+ 2).(A.41) 49 Combining the definition ofS k with (A.41) yields kSk−1 = (k+ 1)S k −E h E ˆfn−k i =kS k + Sk −E h E ˆfn−k i ≤kS k +

Libin Zhu, Chaoyue Liu, and Mikhail Belkin. Transition to linearity of general neural networks with directed acyclic graph architecture. InAdvances in Neural Information Processing Systems, volume 35, pages 5363–5375. Curran Associates, Inc., 2022. 26 A Appendix A.1 Preliminaries In this section, we present the explicit expressions of spherical harmonics ...

work page 2022

[1] [1]

Toward large kernel models

Amirhesam Abedsoltan, Mikhail Belkin, and Parthe Pandit. Toward large kernel models. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 61–78. PMLR, 23–29 Jul 2023

work page 2023

[2] [2]

Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.The Annals of Statistics, 47(4):2117–2144, 2019

Pierre Alquier, Vincent Cottet, and Guillaume Lecu´ e. Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.The Annals of Statistics, 47(4):2117–2144, 2019

work page 2019

[3] [3]

Faster kernel ridge regression using sketching and preconditioning.SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017

Haim Avron, Kenneth Clarkson, and David Woodruff. Faster kernel ridge regression using sketching and preconditioning.SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017

work page 2017

[4] [4]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InProceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, page 451–459, Red Hook, NY, USA,

work page

[5] [5]

Curran Associates Inc

work page

[6] [6]

Truncated kernel stochastic gradient descent on spheres.Math- ematics of Computation, 2025

Jinhui Bai and Lei Shi. Truncated kernel stochastic gradient descent on spheres.Math- ematics of Computation, 2025. Published online. 22

work page 2025

[7] [7]

A general and adaptive robust loss function

Jonathan Barron. A general and adaptive robust loss function. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019

[8] [8]

Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006

Peter Bartlett, Michael Jordan, and Jon Mcauliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006

work page 2006

[9] [9]

Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: Mathematical Optimization Society (MOS), 2017

Amir Beck.First-order Methods in Optimization. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: Mathematical Optimization Society (MOS), 2017

work page 2017

[10] [10]

The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields.Computer Vision and Image Understanding, 63(1):75–104, 1996

Michael Black and Padmanabhan Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields.Computer Vision and Image Understanding, 63(1):75–104, 1996

work page 1996

[11] [11]

Optimal rates for regularization of statistical inverse learning problems.Foundations of Computational Mathematics, 18(4):971–1013, 2018

Gilles Blanchard and Nicole M¨ ucke. Optimal rates for regularization of statistical inverse learning problems.Foundations of Computational Mathematics, 18(4):971–1013, 2018

work page 2018

[12] [12]

Vladimir Bogachev.Measure Theory. Vol. I and II. Berlin: Springer, 2007

work page 2007

[13] [13]

Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

L´ eon Bottou, Frank Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

work page 2018

[14] [14]

Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

Andrea Caponnetto and Ernesto Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

work page 2007

[15] [15]

Consistency and robustness of kernel-based regression in convex risk minimization.Bernoulli, 13(3):799–819, 2007

Andreas Christmann and Ingo Steinwart. Consistency and robustness of kernel-based regression in convex risk minimization.Bernoulli, 13(3):799–819, 2007

work page 2007

[16] [16]

Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2013

Philippe Ciarlet.Linear and Nonlinear Functional Analysis with Applications, volume 130 ofOther Titles in Applied Mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2013

work page 2013

[17] [17]

edition, 2001

Nello Cristianini and John Shawe-Taylor.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.Cambridge: Cambridge University Press, repr. edition, 2001

work page 2001

[18] [18]

On the mathematical foundations of learning.Bulletin of the American Mathematical Society

Felipe Cucker and Steve Smale. On the mathematical foundations of learning.Bulletin of the American Mathematical Society. New Series, 39(1):1–49, 2002

work page 2002

[19] [19]

Springer Monographs in Mathematics

Feng Dai and Yuan Xu.Approximation Theory and Harmonic Analysis on Spheres and Balls. Springer Monographs in Mathematics. New York, NY: Springer, 2013

work page 2013

[20] [20]

Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363–1399, 2016

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363–1399, 2016

work page 2016

[21] [21]

Error bounds for the asymptotic expansion of the ratio of two gamma functions with complex argument.SIAM Journal on Mathematical Analysis, 23(2):505– 511, 1992

Chris Frenzen. Error bounds for the asymptotic expansion of the ratio of two gamma functions with complex argument.SIAM Journal on Mathematical Analysis, 23(2):505– 511, 1992

work page 1992

[22] [22]

The analysis of rates using Poisson regression models.Biometrics, 39:665–674, 1983

Edward Frome. The analysis of rates using Poisson regression models.Biometrics, 39:665–674, 1983. 23

work page 1983

[23] [23]

Improved estimates of global ocean circulation, heat transport and mixing from hydrographic data.Nature, 408(6811):453–457, 2000

Alexandre Ganachaud and Carl Wunsch. Improved estimates of global ocean circulation, heat transport and mixing from hydrographic data.Nature, 408(6811):453–457, 2000

work page 2000

[24] [24]

Optimality of robust online learning

Zheng-Chu Guo, Andreas Christmann, and Lei Shi. Optimality of robust online learning. Foundations of Computational Mathematics, 24(5):1455–1483, 2024

work page 2024

[25] [25]

Fast and strong convergence of online learning algorithms

Zheng-Chu Guo and Lei Shi. Fast and strong convergence of online learning algorithms. Advances in Computational Mathematics, 45(5-6):2745–2770, 2019

work page 2019

[26] [26]

John Wiley & Sons, Hoboken, NJ, 1986

Frank Hampel, Elvezio Ronchetti, Peter Rousseeuw, and Werner Stahel.Robust Statis- tics: The Approach Based on Influence Functions. John Wiley & Sons, Hoboken, NJ, 1986

work page 1986

[27] [27]

Radial basis function approximation of noisy scattered data on the sphere.Numerische Mathematik, 137(3):579–605, 2017

Kerstin Hesse, Ian Sloan, and Robert Womersley. Radial basis function approximation of noisy scattered data on the sphere.Numerische Mathematik, 137(3):579–605, 2017

work page 2017

[28] [28]

Robust regression using iteratively reweighted least- squares.Communications in Statistics - Theory and Methods, 6(9):813–827, 1977

Paul Holland and Roy Welsch. Robust regression using iteratively reweighted least- squares.Communications in Statistics - Theory and Methods, 6(9):813–827, 1977

work page 1977

[29] [29]

Cosmic microwave background anisotropies.Annual Review of Astronomy and Astrophysics, 40(1):171–216, 2002

Wayne Hu and Scott Dodelson. Cosmic microwave background anisotropies.Annual Review of Astronomy and Astrophysics, 40(1):171–216, 2002

work page 2002

[30] [30]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018

[31] [31]

Making the last iterate of SGD information theoretically optimal.SIAM Journal on Optimization, 31(2):1108– 1130, 2021

Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of SGD information theoretically optimal.SIAM Journal on Optimization, 31(2):1108– 1130, 2021

work page 2021

[32] [32]

Cham: Springer, 7th edition edition, 2017

J¨ urgen Jost.Riemannian Geometry and Geometric Analysis. Cham: Springer, 7th edition edition, 2017

work page 2017

[33] [33]

Online learning with kernels

Jyrki Kivinen, Alexander Smola, and Robert Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004

work page 2004

[34] [34]

Cham: Springer, 2020

Guanghui Lan.First-order and Stochastic Optimization Methods for Machine Learning. Cham: Springer, 2020

work page 2020

[35] [35]

New York, NY: Springer, 2nd revised ed edition, 2013

John Lee.Introduction to Smooth Manifolds, volume 218. New York, NY: Springer, 2nd revised ed edition, 2013

work page 2013

[36] [36]

Kernel interpolation of high dimen- sional scattered data.SIAM Journal on Numerical Analysis, 62(3):1098–1118, 2024

Shao-Bo Lin, Xiangyu Chang, and Xingping Sun. Kernel interpolation of high dimen- sional scattered data.SIAM Journal on Numerical Analysis, 62(3):1098–1118, 2024

work page 2024

[37] [37]

Be- yond least-squares: fast rates for regularized empirical risk minimization through self- concordance

Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Be- yond least-squares: fast rates for regularized empirical risk minimization through self- concordance. InProceedings of the Thirty-Second Conference on Learning Theory, vol- ume 99 ofProceedings of Machine Learning Research, pages 2294–2340. PMLR, 25–28 Jun 2019

work page 2019

[38] [38]

Nonparametric regression for spher- ical data.Journal of the American Statistical Association, 109(506):748–763, 2014

Marco Marzio, Agnese Panzera, and Charles Taylor. Nonparametric regression for spher- ical data.Journal of the American Statistical Association, 109(506):748–763, 2014. 24

work page 2014

[39] [39]

Nonparametric rotations for sphere- sphere regression.Journal of the American Statistical Association, 114(525):466–476, 2019

Marco Marzio, Agnese Panzera, and Charles Taylor. Nonparametric rotations for sphere- sphere regression.Journal of the American Statistical Association, 114(525):466–476, 2019

work page 2019

[40] [40]

Regularization in kernel learning.The Annals of Statistics, 38(1):526–565, 2010

Shahar Mendelson and Joseph Neeman. Regularization in kernel learning.The Annals of Statistics, 38(1):526–565, 2010

work page 2010

[41] [41]

Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950

Aronszajn Nachman. Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950

work page 1950

[42] [42]

A Basic Course., vol- ume 87 ofApplied Optimization

Yurii Nesterov.Introductory Lectures on Convex Optimization. A Basic Course., vol- ume 87 ofApplied Optimization. Boston: Kluwer Academic Publishers, 2004

work page 2004

[43] [43]

Cambridge: Cambridge University Press, 2010

Frank Olver, Daniel Lozier, Ronald Boisvert, and Charles Clark, editors.NIST Handbook of Mathematical Functions. Cambridge: Cambridge University Press, 2010

work page 2010

[44] [44]

Spatially adaptive smoothing splines.Biometrika, 93(1):113–125, 2006

Alexandre Pintore, Paul Speckman, and Chris Holmes. Spatially adaptive smoothing splines.Biometrika, 93(1):113–125, 2006

work page 2006

[45] [45]

Acceleration of stochastic approximation by averag- ing.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averag- ing.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

work page 1992

[46] [46]

Early stopping and non-parametric regression: an optimal data-dependent stopping rule.Journal of Machine Learning Re- search, 15:335–366, 2014

Garvesh Raskutti, Martin Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule.Journal of Machine Learning Re- search, 15:335–366, 2014

work page 2014

[47] [47]

A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407, 1951

work page 1951

[48] [48]

Spherical regression models using projective linear transformations.Journal of the American Statistical As- sociation, 109(508):1615–1624, 2014

Michael Rosenthal, Wei Wu, Eric Klassen, and Anuj Srivastava. Spherical regression models using projective linear transformations.Journal of the American Statistical As- sociation, 109(508):1615–1624, 2014

work page 2014

[49] [49]

On fast leverage score sampling and optimal learning

Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast leverage score sampling and optimal learning. InAdvances in Neural Information Pro- cessing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018

[50] [50]

FALKON: an optimal large scale kernel method

Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: an optimal large scale kernel method. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[51] [51]

Sparse Cholesky factorization by Kullback-Leibler minimization.SIAM Journal on Scientific Computing, 43(3):a2019– a2046, 2021

Florian Sch¨ afer, Matthias Katzfuss, and Houman Owhadi. Sparse Cholesky factorization by Kullback-Leibler minimization.SIAM Journal on Scientific Computing, 43(3):a2019– a2046, 2021

work page 2021

[52] [52]

The MIT Press, 12 2001

Bernhard Sch¨ olkopf and Alexander Smola.Learning with Kernels: Support Vector Ma- chines, Regularization, Optimization, and Beyond. The MIT Press, 12 2001

work page 2001

[53] [53]

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InProceedings of the 30th Interna- tional Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 71–79, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. 25

work page 2013

[54] [54]

Online learning algorithms.Foundations of Computational Mathematics, 6(2):145–170, 2006

Steve Smale and Yuan Yao. Online learning algorithms.Foundations of Computational Mathematics, 6(2):145–170, 2006

work page 2006

[55] [55]

Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26(2):153–172, 2007

Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26(2):153–172, 2007

work page 2007

[56] [56]

New York, NY: Springer, 2008

Ingo Steinwart and Andreas Christmann.Support Vector Machines. New York, NY: Springer, 2008

work page 2008

[57] [57]

Online learning as stochastic approximation of regulariza- tion paths: optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

Pierre Tarr` es and Yuan Yao. Online learning as stochastic approximation of regulariza- tion paths: optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

work page 2014

[58] [58]

The nystr¨ om method for convex loss functions.Journal of Machine Learning Research, 25(360):1–60, 2024

Andrea Vecchia, Ernesto Vito, Jaouad Mourtada, and Lorenzo Rosasco. The nystr¨ om method for convex loss functions.Journal of Machine Learning Research, 25(360):1–60, 2024

work page 2024

[59] [59]

Cam- bridge Series in Statistical and Probabilistic Mathematics

Martin Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam- bridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019

work page 2019

[60] [60]

Marginal longitudinal nonparametric regression: locality and efficiency of spline and kernel methods.Journal of the American Statistical Association, 97(458):482–493, 2002

Alan Welsh, Xihong Lin, and Raymond Carroll. Marginal longitudinal nonparametric regression: locality and efficiency of spline and kernel methods.Journal of the American Statistical Association, 97(458):482–493, 2002

work page 2002

[61] [61]

Benefits of early stopping in gradient descent for overparameterized logistic regression

Jingfeng Wu, Peter Bartlett, Matus Telgarsky, and Bin Yu. Benefits of early stopping in gradient descent for overparameterized logistic regression. InForty-second International Conference on Machine Learning, 2025

work page 2025

[62] [62]

Robust truncated hinge loss support vector machines.Jour- nal of the American Statistical Association, 102(479):974–983, 2007

Yichao Wu and Yufeng Liu. Robust truncated hinge loss support vector machines.Jour- nal of the American Statistical Association, 102(479):974–983, 2007

work page 2007

[63] [63]

Randomized sketches for kernels: fast and optimal nonparametric regression.The Annals of Statistics, 45(3):991–1023, 2017

Yun Yang, Mert Pilanci, and Martin Wainwright. Randomized sketches for kernels: fast and optimal nonparametric regression.The Annals of Statistics, 45(3):991–1023, 2017

work page 2017

[64] [64]

Online gradient descent learning algorithms

Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008

work page 2008

[65] [65]

A sieve stochastic gradient descent estimator for online nonparametric regression in Sobolev ellipsoids.The Annals of Statistics, 50(5):2848–2871, 2022

Tianyu Zhang and Noah Simon. A sieve stochastic gradient descent estimator for online nonparametric regression in Sobolev ellipsoids.The Annals of Statistics, 50(5):2848–2871, 2022

work page 2022

[66] [66]

Statistical behavior and consistency of classification methods based on convex risk minimization.The Annals of Statistics, 32(1):56–85, 2004

Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization.The Annals of Statistics, 32(1):56–85, 2004

work page 2004

[67] [67]

Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates.Journal of Machine Learning Research, 16:3299–3340, 2015

Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates.Journal of Machine Learning Research, 16:3299–3340, 2015

work page 2015

[68] [68]

8eC γ0 +γ 0M 2 1 # (n+ 1) − 2r 2r+1 log(n+ 2).(A.41) 49 Combining the definition ofS k with (A.41) yields kSk−1 = (k+ 1)S k −E h E ˆfn−k i =kS k + Sk −E h E ˆfn−k i ≤kS k +

Libin Zhu, Chaoyue Liu, and Mikhail Belkin. Transition to linearity of general neural networks with directed acyclic graph architecture. InAdvances in Neural Information Processing Systems, volume 35, pages 5363–5375. Curran Associates, Inc., 2022. 26 A Appendix A.1 Preliminaries In this section, we present the explicit expressions of spherical harmonics ...

work page 2022