Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions
Pith reviewed 2026-05-18 09:50 UTC · model grok-4.3
The pith
Truncated kernel SGD projects stochastic gradients onto adaptively scaled finite-dimensional spaces using spherical radial basis functions and achieves minimax-optimal convergence rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By projecting the stochastic gradient onto a finite-dimensional space via the series expansion of spherical radial basis functions and using a refined estimate of the eigenvalue decay of the covariance operator, the algorithm attains minimax optimal rates for both the last iterate and the averaged iterates, as well as optimal rates in the RKHS norm, for a wide range of convex loss functions.
What carries the argument
Adaptive truncation of the kernel gradient using the infinite series of spherical radial basis functions, driven by a new spectral estimate of the covariance operator to balance bias and variance.
If this is right
- The last iterate of the algorithm converges at the minimax-optimal rate.
- The suffix average also achieves the optimal rate.
- Optimal strong convergence holds in the reproducing kernel Hilbert space.
- The method works for least-squares, Huber, and logistic losses.
- Computational and storage complexity are reduced to linear in the sample size, supporting streaming data.
Where Pith is reading between the lines
- If the spectral estimation holds for other kernel families, similar truncation strategies could extend to non-radial kernels.
- The coordinate-wise update structure suggests natural parallels to coordinate descent methods in high-dimensional statistics.
- Testing the adaptive scaling rule on datasets with varying noise levels could confirm the bias-variance justification in practice.
- Strong RKHS convergence may imply faster rates for downstream tasks like model selection or uncertainty quantification.
Load-bearing premise
The estimate of the spectral decay of the kernel-induced covariance operator must be sufficiently precise to correctly guide the adaptive choice of projection dimension and to combine the optimization and generalization error bounds.
What would settle it
A numerical counterexample or real dataset where the observed convergence rate of the last iterate falls short of the claimed minimax rate, or where increasing the projection dimension beyond the bias-variance optimum degrades performance contrary to the analysis.
Figures
read the original abstract
In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a truncated kernel SGD algorithm that projects stochastic gradients onto a finite-dimensional space via spherical radial basis function expansions, with the truncation dimension chosen adaptively according to a bias-variance tradeoff. A new estimation of the spectral structure of the kernel-induced covariance operator is introduced to unify optimization and generalization analyses. The authors claim to prove minimax-optimal rates for both the last iterate and suffix average, plus optimal strong convergence in the RKHS, for general losses including least-squares, Huber, and logistic. The approach also achieves linear-time coordinate-wise updates and optimal storage for streaming data, with supporting numerical experiments.
Significance. If the central claims are established, the work would offer a scalable kernel method with rigorous guarantees that bridge optimization and statistical performance across non-quadratic losses. The attempt to derive adaptive truncation from a unified spectral analysis is a positive direction for large-scale kernel learning.
major comments (2)
- [Abstract and spectral estimation framework] Abstract and the section introducing the new spectral estimation: the unification of optimization and generalization bounds, as well as the justification for adaptive scaling of the finite-dimensional projection, rests on a new estimation of the spectral structure of the covariance operator. For general losses the stochastic gradient is modulated by the loss derivative evaluated at the current prediction; it is not immediate that the same spectral bounds hold without additional assumptions (e.g., uniform boundedness or Lipschitz conditions on the derivative that interact with kernel eigenvalues). Please provide the precise lemma establishing this extension and state any extra conditions required.
- [Main convergence theorems] Theorems on minimax-optimal rates (last iterate, suffix average, and strong RKHS convergence): these rates are asserted to follow from the spectral estimation and the bias-variance choice of truncation dimension. If the estimation is derived from the same data used to tune the adaptive scale, the rates risk becoming fitted quantities rather than independent predictions. Clarify whether the spectral bounds are obtained independently of the tuning data and how they remain valid under the weaker structural assumptions typical of non-quadratic losses.
minor comments (2)
- [Algorithm and complexity analysis] The description of coordinate-wise updates and streaming-data handling is clear, but a short complexity table comparing wall-clock time and memory against standard kernel SGD would improve readability.
- [Notation and preliminaries] Notation for the adaptive truncation scale and the finite-dimensional projection dimension should be introduced once and used consistently in all statements of the rates.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and spectral estimation framework] Abstract and the section introducing the new spectral estimation: the unification of optimization and generalization bounds, as well as the justification for adaptive scaling of the finite-dimensional projection, rests on a new estimation of the spectral structure of the covariance operator. For general losses the stochastic gradient is modulated by the loss derivative evaluated at the current prediction; it is not immediate that the same spectral bounds hold without additional assumptions (e.g., uniform boundedness or Lipschitz conditions on the derivative that interact with kernel eigenvalues). Please provide the precise lemma establishing this extension and state any extra conditions required.
Authors: We thank the referee for this observation. The original manuscript implicitly relied on bounded loss derivatives for the losses considered but did not isolate the extension in a dedicated lemma. In the revision we add Lemma 3.2, which shows that if the loss derivative is uniformly bounded by a constant B (Assumption 2.3), then the operator-norm and trace bounds on the modulated covariance operator are identical to the quadratic case up to the factor B. This covers logistic loss (B=1/4), Huber loss (B equal to the threshold), and least-squares (B=1). The proof uses the fact that the modulation is a scalar multiplier whose absolute value is at most B, preserving the eigenvalue summability and decay rates of the underlying kernel operator. We also state the assumption explicitly and note that it is standard for these losses. revision: yes
-
Referee: [Main convergence theorems] Theorems on minimax-optimal rates (last iterate, suffix average, and strong RKHS convergence): these rates are asserted to follow from the spectral estimation and the bias-variance choice of truncation dimension. If the estimation is derived from the same data used to tune the adaptive scale, the rates risk becoming fitted quantities rather than independent predictions. Clarify whether the spectral bounds are obtained independently of the tuning data and how they remain valid under the weaker structural assumptions typical of non-quadratic losses.
Authors: The spectral bounds are derived from the population kernel covariance operator and its known eigenvalue decay (e.g., polynomial or exponential), which are data-independent. The adaptive truncation dimension m is selected via a theoretical bias-variance formula that uses only these population upper bounds, not empirical estimates computed from the same training sample that drives the SGD iterates. High-probability deviation between empirical and population spectra is controlled separately by matrix Bernstein inequalities in the proof, ensuring the final rates remain valid predictions rather than fitted quantities. For non-quadratic losses the same bounded-derivative assumption (now stated as Assumption 2.3) suffices to bound the variance term; no additional strong-convexity or Lipschitz-gradient assumptions are required beyond what is already used for the quadratic case. We have added a clarifying paragraph after Theorem 4.3 and a remark in Section 4.1. revision: yes
Circularity Check
No significant circularity; spectral estimation is an independent contribution
full rationale
The paper introduces a 'new estimation of the spectral structure of the kernel-induced covariance operator' as the foundation for its analytical framework that unifies optimization and generalization. This estimation is developed within the work and then applied to establish minimax-optimal rates for the last iterate and suffix average, plus optimal strong RKHS convergence. No quoted equations or self-citations reduce the central claims to tautologies or fitted inputs renamed as predictions. The accommodation of general losses (least-squares, Huber, logistic) is asserted under the paper's assumptions without evidence of self-definitional loops or load-bearing prior self-work that lacks external verification. The derivation chain is therefore self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- adaptive truncation scale
axioms (2)
- domain assumption The infinite series expansion of spherical radial basis functions permits an accurate finite-dimensional projection of the stochastic gradient.
- ad hoc to paper A new estimation of the spectral structure of the kernel-induced covariance operator is sufficiently accurate to unify optimization and generalization bounds.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that both the last iterate and the suffix average converge at minimax-optimal rates... under local μ-strong convexity and L-smoothness.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Amirhesam Abedsoltan, Mikhail Belkin, and Parthe Pandit. Toward large kernel models. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 61–78. PMLR, 23–29 Jul 2023
work page 2023
-
[2]
Pierre Alquier, Vincent Cottet, and Guillaume Lecu´ e. Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.The Annals of Statistics, 47(4):2117–2144, 2019
work page 2019
-
[3]
Haim Avron, Kenneth Clarkson, and David Woodruff. Faster kernel ridge regression using sketching and preconditioning.SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017
work page 2017
-
[4]
Non-asymptotic analysis of stochastic approximation algorithms for machine learning
Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InProceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, page 451–459, Red Hook, NY, USA,
-
[5]
Curran Associates Inc
-
[6]
Truncated kernel stochastic gradient descent on spheres.Math- ematics of Computation, 2025
Jinhui Bai and Lei Shi. Truncated kernel stochastic gradient descent on spheres.Math- ematics of Computation, 2025. Published online. 22
work page 2025
-
[7]
A general and adaptive robust loss function
Jonathan Barron. A general and adaptive robust loss function. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
work page 2019
-
[8]
Peter Bartlett, Michael Jordan, and Jon Mcauliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006
work page 2006
-
[9]
Amir Beck.First-order Methods in Optimization. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); Philadelphia, PA: Mathematical Optimization Society (MOS), 2017
work page 2017
-
[10]
Michael Black and Padmanabhan Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields.Computer Vision and Image Understanding, 63(1):75–104, 1996
work page 1996
-
[11]
Gilles Blanchard and Nicole M¨ ucke. Optimal rates for regularization of statistical inverse learning problems.Foundations of Computational Mathematics, 18(4):971–1013, 2018
work page 2018
-
[12]
Vladimir Bogachev.Measure Theory. Vol. I and II. Berlin: Springer, 2007
work page 2007
-
[13]
Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018
L´ eon Bottou, Frank Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018
work page 2018
-
[14]
Andrea Caponnetto and Ernesto Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007
work page 2007
-
[15]
Andreas Christmann and Ingo Steinwart. Consistency and robustness of kernel-based regression in convex risk minimization.Bernoulli, 13(3):799–819, 2007
work page 2007
-
[16]
Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2013
Philippe Ciarlet.Linear and Nonlinear Functional Analysis with Applications, volume 130 ofOther Titles in Applied Mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2013
work page 2013
-
[17]
Nello Cristianini and John Shawe-Taylor.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.Cambridge: Cambridge University Press, repr. edition, 2001
work page 2001
-
[18]
On the mathematical foundations of learning.Bulletin of the American Mathematical Society
Felipe Cucker and Steve Smale. On the mathematical foundations of learning.Bulletin of the American Mathematical Society. New Series, 39(1):1–49, 2002
work page 2002
-
[19]
Springer Monographs in Mathematics
Feng Dai and Yuan Xu.Approximation Theory and Harmonic Analysis on Spheres and Balls. Springer Monographs in Mathematics. New York, NY: Springer, 2013
work page 2013
-
[20]
Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363–1399, 2016
work page 2016
-
[21]
Chris Frenzen. Error bounds for the asymptotic expansion of the ratio of two gamma functions with complex argument.SIAM Journal on Mathematical Analysis, 23(2):505– 511, 1992
work page 1992
-
[22]
The analysis of rates using Poisson regression models.Biometrics, 39:665–674, 1983
Edward Frome. The analysis of rates using Poisson regression models.Biometrics, 39:665–674, 1983. 23
work page 1983
-
[23]
Alexandre Ganachaud and Carl Wunsch. Improved estimates of global ocean circulation, heat transport and mixing from hydrographic data.Nature, 408(6811):453–457, 2000
work page 2000
-
[24]
Optimality of robust online learning
Zheng-Chu Guo, Andreas Christmann, and Lei Shi. Optimality of robust online learning. Foundations of Computational Mathematics, 24(5):1455–1483, 2024
work page 2024
-
[25]
Fast and strong convergence of online learning algorithms
Zheng-Chu Guo and Lei Shi. Fast and strong convergence of online learning algorithms. Advances in Computational Mathematics, 45(5-6):2745–2770, 2019
work page 2019
-
[26]
John Wiley & Sons, Hoboken, NJ, 1986
Frank Hampel, Elvezio Ronchetti, Peter Rousseeuw, and Werner Stahel.Robust Statis- tics: The Approach Based on Influence Functions. John Wiley & Sons, Hoboken, NJ, 1986
work page 1986
-
[27]
Kerstin Hesse, Ian Sloan, and Robert Womersley. Radial basis function approximation of noisy scattered data on the sphere.Numerische Mathematik, 137(3):579–605, 2017
work page 2017
-
[28]
Paul Holland and Roy Welsch. Robust regression using iteratively reweighted least- squares.Communications in Statistics - Theory and Methods, 6(9):813–827, 1977
work page 1977
-
[29]
Wayne Hu and Scott Dodelson. Cosmic microwave background anisotropies.Annual Review of Astronomy and Astrophysics, 40(1):171–216, 2002
work page 2002
-
[30]
Neural tangent kernel: Convergence and generalization in neural networks
Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018
work page 2018
-
[31]
Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of SGD information theoretically optimal.SIAM Journal on Optimization, 31(2):1108– 1130, 2021
work page 2021
-
[32]
Cham: Springer, 7th edition edition, 2017
J¨ urgen Jost.Riemannian Geometry and Geometric Analysis. Cham: Springer, 7th edition edition, 2017
work page 2017
-
[33]
Jyrki Kivinen, Alexander Smola, and Robert Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004
work page 2004
-
[34]
Guanghui Lan.First-order and Stochastic Optimization Methods for Machine Learning. Cham: Springer, 2020
work page 2020
-
[35]
New York, NY: Springer, 2nd revised ed edition, 2013
John Lee.Introduction to Smooth Manifolds, volume 218. New York, NY: Springer, 2nd revised ed edition, 2013
work page 2013
-
[36]
Shao-Bo Lin, Xiangyu Chang, and Xingping Sun. Kernel interpolation of high dimen- sional scattered data.SIAM Journal on Numerical Analysis, 62(3):1098–1118, 2024
work page 2024
-
[37]
Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Be- yond least-squares: fast rates for regularized empirical risk minimization through self- concordance. InProceedings of the Thirty-Second Conference on Learning Theory, vol- ume 99 ofProceedings of Machine Learning Research, pages 2294–2340. PMLR, 25–28 Jun 2019
work page 2019
-
[38]
Marco Marzio, Agnese Panzera, and Charles Taylor. Nonparametric regression for spher- ical data.Journal of the American Statistical Association, 109(506):748–763, 2014. 24
work page 2014
-
[39]
Marco Marzio, Agnese Panzera, and Charles Taylor. Nonparametric rotations for sphere- sphere regression.Journal of the American Statistical Association, 114(525):466–476, 2019
work page 2019
-
[40]
Regularization in kernel learning.The Annals of Statistics, 38(1):526–565, 2010
Shahar Mendelson and Joseph Neeman. Regularization in kernel learning.The Annals of Statistics, 38(1):526–565, 2010
work page 2010
-
[41]
Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950
Aronszajn Nachman. Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950
work page 1950
-
[42]
A Basic Course., vol- ume 87 ofApplied Optimization
Yurii Nesterov.Introductory Lectures on Convex Optimization. A Basic Course., vol- ume 87 ofApplied Optimization. Boston: Kluwer Academic Publishers, 2004
work page 2004
-
[43]
Cambridge: Cambridge University Press, 2010
Frank Olver, Daniel Lozier, Ronald Boisvert, and Charles Clark, editors.NIST Handbook of Mathematical Functions. Cambridge: Cambridge University Press, 2010
work page 2010
-
[44]
Spatially adaptive smoothing splines.Biometrika, 93(1):113–125, 2006
Alexandre Pintore, Paul Speckman, and Chris Holmes. Spatially adaptive smoothing splines.Biometrika, 93(1):113–125, 2006
work page 2006
-
[45]
Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averag- ing.SIAM Journal on Control and Optimization, 30(4):838–855, 1992
work page 1992
-
[46]
Garvesh Raskutti, Martin Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule.Journal of Machine Learning Re- search, 15:335–366, 2014
work page 2014
-
[47]
A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407, 1951
work page 1951
-
[48]
Michael Rosenthal, Wei Wu, Eric Klassen, and Anuj Srivastava. Spherical regression models using projective linear transformations.Journal of the American Statistical As- sociation, 109(508):1615–1624, 2014
work page 2014
-
[49]
On fast leverage score sampling and optimal learning
Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast leverage score sampling and optimal learning. InAdvances in Neural Information Pro- cessing Systems, volume 31. Curran Associates, Inc., 2018
work page 2018
-
[50]
FALKON: an optimal large scale kernel method
Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: an optimal large scale kernel method. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[51]
Florian Sch¨ afer, Matthias Katzfuss, and Houman Owhadi. Sparse Cholesky factorization by Kullback-Leibler minimization.SIAM Journal on Scientific Computing, 43(3):a2019– a2046, 2021
work page 2021
-
[52]
Bernhard Sch¨ olkopf and Alexander Smola.Learning with Kernels: Support Vector Ma- chines, Regularization, Optimization, and Beyond. The MIT Press, 12 2001
work page 2001
-
[53]
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InProceedings of the 30th Interna- tional Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 71–79, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. 25
work page 2013
-
[54]
Online learning algorithms.Foundations of Computational Mathematics, 6(2):145–170, 2006
Steve Smale and Yuan Yao. Online learning algorithms.Foundations of Computational Mathematics, 6(2):145–170, 2006
work page 2006
-
[55]
Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations.Constructive Approximation, 26(2):153–172, 2007
work page 2007
-
[56]
Ingo Steinwart and Andreas Christmann.Support Vector Machines. New York, NY: Springer, 2008
work page 2008
-
[57]
Pierre Tarr` es and Yuan Yao. Online learning as stochastic approximation of regulariza- tion paths: optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014
work page 2014
-
[58]
Andrea Vecchia, Ernesto Vito, Jaouad Mourtada, and Lorenzo Rosasco. The nystr¨ om method for convex loss functions.Journal of Machine Learning Research, 25(360):1–60, 2024
work page 2024
-
[59]
Cam- bridge Series in Statistical and Probabilistic Mathematics
Martin Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam- bridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019
work page 2019
-
[60]
Alan Welsh, Xihong Lin, and Raymond Carroll. Marginal longitudinal nonparametric regression: locality and efficiency of spline and kernel methods.Journal of the American Statistical Association, 97(458):482–493, 2002
work page 2002
-
[61]
Benefits of early stopping in gradient descent for overparameterized logistic regression
Jingfeng Wu, Peter Bartlett, Matus Telgarsky, and Bin Yu. Benefits of early stopping in gradient descent for overparameterized logistic regression. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[62]
Yichao Wu and Yufeng Liu. Robust truncated hinge loss support vector machines.Jour- nal of the American Statistical Association, 102(479):974–983, 2007
work page 2007
-
[63]
Yun Yang, Mert Pilanci, and Martin Wainwright. Randomized sketches for kernels: fast and optimal nonparametric regression.The Annals of Statistics, 45(3):991–1023, 2017
work page 2017
-
[64]
Online gradient descent learning algorithms
Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008
work page 2008
-
[65]
Tianyu Zhang and Noah Simon. A sieve stochastic gradient descent estimator for online nonparametric regression in Sobolev ellipsoids.The Annals of Statistics, 50(5):2848–2871, 2022
work page 2022
-
[66]
Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization.The Annals of Statistics, 32(1):56–85, 2004
work page 2004
-
[67]
Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates.Journal of Machine Learning Research, 16:3299–3340, 2015
work page 2015
-
[68]
Libin Zhu, Chaoyue Liu, and Mikhail Belkin. Transition to linearity of general neural networks with directed acyclic graph architecture. InAdvances in Neural Information Processing Systems, volume 35, pages 5363–5375. Curran Associates, Inc., 2022. 26 A Appendix A.1 Preliminaries In this section, we present the explicit expressions of spherical harmonics ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.