pith. sign in

arxiv: 2504.18184 · v4 · submitted 2025-04-25 · 📊 stat.ML · cs.LG· math.FA· math.ST· stat.TH

Learning Operators by Regularized Stochastic Gradient Descent with Operator-valued Kernels

Pith reviewed 2026-05-22 18:18 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.FAmath.STstat.TH
keywords operator learningregularized SGDoperator-valued kernelsvector-valued RKHSdimension-independent boundsconvergence ratesstatistical inverse problemsstructured prediction
0
0 comments X

The pith

Regularized SGD with operator-valued kernels delivers dimension-independent bounds for learning regression operators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the problem of estimating an unknown operator that maps from a Polish space into a separable Hilbert space, where the operator belongs to a vector-valued reproducing kernel Hilbert space generated by an operator-valued kernel. It examines regularized stochastic gradient descent in two regimes: an online version with polynomially decaying step sizes and regularization, and a finite-horizon version with fixed parameters. Under structural and distributional assumptions, the analysis yields error bounds for both prediction and estimation that do not grow with the dimension of the output space. These bounds are shown to be near-optimal in expectation, while high-probability versions imply almost sure convergence of the iterates.

Core claim

Under suitable structural and distributional assumptions on the target operator and the data-generating process, regularized stochastic gradient descent algorithms applied to the vector-valued RKHS induced by an operator-valued kernel produce dimension-independent bounds on prediction and estimation errors. The resulting rates are near-optimal in expectation for both online and finite-horizon settings, and high-probability estimates are derived that imply almost sure convergence in infinite-dimensional output spaces.

What carries the argument

Regularized stochastic gradient descent iterates on the vector-valued reproducing kernel Hilbert space induced by an operator-valued kernel, which regularizes the ill-posed inverse problem of operator estimation.

If this is right

  • The method yields near-optimal rates without explicit dependence on output dimension, enabling use in high- or infinite-dimensional output settings.
  • High-probability bounds guarantee almost sure convergence of the learned operator iterates.
  • The framework directly applies to structured prediction tasks where outputs are elements of a Hilbert space.
  • Concrete examples show the approach extends to learning solution operators for parametric partial differential equations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The general technique for high-probability guarantees in infinite dimensions could be adapted to other kernel methods that operate on function-valued data.
  • Numerical experiments on function-valued regression problems with growing output dimension would provide direct checks on whether the predicted dimension independence appears in practice.
  • The same regularization and step-size schedules might transfer to related stochastic approximation schemes for operator equations arising in control or inverse problems.

Load-bearing premise

The target operator and the data-generating process satisfy structural and distributional assumptions that keep the problem well-behaved enough for dimension-free error control.

What would settle it

Construct a data distribution and target operator satisfying the paper's stated assumptions yet produce prediction error that grows with the dimension of the output Hilbert space; if the observed error scales with dimension, the dimension-independence claim fails.

Figures

Figures reproduced from arXiv: 2504.18184 by Jia-Qi Yang, Lei Shi.

Figure 1
Figure 1. Figure 1: Surrogate approach for structured prediction [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Commutative diagram of PCA encoder-decoder framework [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
read the original abstract

We consider a class of statistical inverse problems involving the estimation of a regression operator from a Polish space to a separable Hilbert space, where the target lies in a vector-valued reproducing kernel Hilbert space induced by an operator-valued kernel. To address the associated ill-posedness, we analyze regularized stochastic gradient descent (SGD) algorithms in both online and finite-horizon settings. The former uses polynomially decaying step sizes and regularization parameters, while the latter adopts fixed values. Under suitable structural and distributional assumptions, we establish dimension-independent bounds for prediction and estimation errors. The resulting convergence rates are near-optimal in expectation, and we also derive high-probability estimates that imply almost sure convergence. Our analysis introduces a general technique for obtaining high-probability guarantees in infinite-dimensional settings. We illustrate the practical scope of our framework with applications to structured prediction and parametric PDEs, providing examples that reflect how the approach can be applied in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes regularized stochastic gradient descent (SGD) for estimating a regression operator from a Polish space to a separable Hilbert space, where the target lies in a vector-valued RKHS induced by an operator-valued kernel. It considers both online (polynomially decaying step sizes and regularization) and finite-horizon (fixed parameters) settings. Under structural and distributional assumptions (source condition, noise moments), the authors derive dimension-independent bounds on prediction and estimation errors, establish near-optimal rates in expectation, and provide high-probability estimates implying almost-sure convergence via a general technique for infinite-dimensional settings. Illustrations are given for structured prediction and parametric PDEs.

Significance. If the central claims hold, the work would advance the theoretical analysis of operator learning in infinite-dimensional spaces by supplying convergence guarantees for SGD that remain dimension-independent. The proposed general technique for high-probability bounds in separable Hilbert spaces could serve as a template for related statistical inverse problems. The applications to structured prediction and PDEs demonstrate practical relevance, though the primary contribution is the theoretical development of error bounds.

major comments (2)
  1. [§4] §4 (High-probability analysis): The martingale concentration step used to obtain high-probability operator-norm bounds must be verified to apply directly to general operator-valued kernels without implicit finite-rank or trace-class restrictions. If the argument reduces the process via a fixed test functional or employs a chaining argument whose covering numbers depend on the effective dimension of the RKHS, the claimed dimension-independence may fail for kernels with slowly decaying eigenvalues; the structural assumptions listed do not explicitly preclude this.
  2. [Theorem 3.1] Theorem 3.1 and Corollary 3.2 (expectation bounds): The near-optimality claim for the convergence rates in expectation relies on the specific choice of polynomially decaying step sizes; it is unclear whether the constants remain uniform when the source condition parameter and noise moments vary simultaneously, which could affect the dimension-free character of the final rates.
minor comments (2)
  1. [§2] Notation for the operator-valued kernel and the associated RKHS should be introduced with an explicit reference to the reproducing property in the vector-valued case to avoid ambiguity when passing between scalar and operator settings.
  2. [§3.2] The finite-horizon setting would benefit from a short remark clarifying how the fixed regularization parameter interacts with the horizon length to maintain the claimed rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and indicate where clarifications will be incorporated in the revision.

read point-by-point responses
  1. Referee: [§4] §4 (High-probability analysis): The martingale concentration step used to obtain high-probability operator-norm bounds must be verified to apply directly to general operator-valued kernels without implicit finite-rank or trace-class restrictions. If the argument reduces the process via a fixed test functional or employs a chaining argument whose covering numbers depend on the effective dimension of the RKHS, the claimed dimension-independence may fail for kernels with slowly decaying eigenvalues; the structural assumptions listed do not explicitly preclude this.

    Authors: We appreciate the referee's concern. The high-probability bounds in Section 4 are obtained via a general martingale concentration inequality for processes taking values in separable Hilbert spaces (invoking a vector-valued version of Freedman's inequality or equivalent results that hold without finite-rank or trace-class assumptions). The argument bounds the operator norm directly using the separability of the codomain and the uniform boundedness of the operator-valued kernel; it does not reduce the process to a fixed test functional nor employ chaining whose covering numbers depend on the effective dimension of the RKHS. The source condition together with the moment assumptions on the noise control the variance terms uniformly, so that the final rates remain dimension-independent even when the kernel eigenvalues decay slowly. To make this generality explicit, we will add a short remark after the statement of the concentration lemma. revision: partial

  2. Referee: [Theorem 3.1] Theorem 3.1 and Corollary 3.2 (expectation bounds): The near-optimality claim for the convergence rates in expectation relies on the specific choice of polynomially decaying step sizes; it is unclear whether the constants remain uniform when the source condition parameter and noise moments vary simultaneously, which could affect the dimension-free character of the final rates.

    Authors: The near-optimality statements in Theorem 3.1 and Corollary 3.2 are with respect to the minimax rates that are known to depend on the source-condition index and the noise-moment order. The polynomial schedules for the step size and regularization parameter are chosen precisely to attain these rates. The multiplicative constants appearing in the bounds are explicit functions of those parameters (as well as the kernel bound and the initial error); they are therefore not claimed to be uniform over all possible source indices and noise moments. The dimension-free character of the rates refers exclusively to the absence of any dependence on the dimension of the input Polish space or the output Hilbert space, which is preserved regardless of the values taken by the source and noise parameters. We will add a clarifying paragraph in the discussion following Corollary 3.2 that makes this dependence explicit and reiterates that dimension independence is unaffected. revision: partial

Circularity Check

0 steps flagged

Theoretical derivation of dimension-independent bounds from structural assumptions; minor self-citation present but not load-bearing for central claims.

full rationale

The paper derives prediction and estimation error bounds for regularized SGD with operator-valued kernels under explicit structural and distributional assumptions (source conditions, noise moments, step-size schedules). These bounds are obtained via standard concentration and martingale arguments in separable Hilbert spaces rather than by fitting parameters to data or reducing predictions to inputs by construction. No self-definitional loops, fitted-input predictions, or uniqueness theorems imported from the authors' prior work appear in the derivation chain. The high-probability estimates are presented as a general technique for infinite-dimensional settings, but the analysis remains self-contained against external benchmarks once the listed assumptions are granted. A low score of 2 accounts for possible routine self-citations that do not carry the central claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on standard but unspecified structural and distributional assumptions typical of statistical learning in Hilbert spaces; no free parameters are explicitly fitted to data in the abstract description.

free parameters (1)
  • step sizes and regularization parameters
    Polynomially decaying or fixed values chosen for the online and finite-horizon algorithms to achieve the stated rates.
axioms (1)
  • domain assumption Suitable structural and distributional assumptions on the regression operator and data distribution.
    Invoked to establish dimension-independent bounds and convergence rates.

pith-pipeline@v0.9.0 · 5695 in / 1136 out tokens · 101790 ms · 2026-05-22T18:18:58.017475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Kernel methods are com- petitive for operator learning.Journal of Computational Physics, 496:112549, 2024

    Pau Batlle, Matthieu Darcy, Bamdad Hosseini, and Houman Owhadi. Kernel methods are com- petitive for operator learning.Journal of Computational Physics, 496:112549, 2024. 53

  2. [2]

    Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020

    Rapha¨ el Berthier, Francis Bach, and Pierre Gaillard. Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020

  3. [3]

    Model reduction and neural networks for parametric PDEs.The SMAI Journal of Computational Math- ematics, 7:121–157, 2021

    Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduction and neural networks for parametric PDEs.The SMAI Journal of Computational Math- ematics, 7:121–157, 2021

  4. [4]

    Vector-valued least-squares regression under output regularity assumptions.Journal of Machine Learning Research, 23(344):1–50, 2022

    Luc Brogat-Motte, Alessandro Rudi, C´ eline Brouard, Juho Rousu, and Florence d’Alch´ e Buc. Vector-valued least-squares regression under output regularity assumptions.Journal of Machine Learning Research, 23(344):1–50, 2022

  5. [5]

    Semi-supervised penalized output kernel regression for link prediction

    C´ eline Brouard, Florence d’Alch´ e Buc, and Marie Szafranski. Semi-supervised penalized output kernel regression for link prediction. In28th International Conference on Machine Learning (ICML 2011), pages 593–600, 2011

  6. [6]

    Fast metabolite identification with input output kernel regression.Bioinformatics, 32(12):i28–i36, 2016

    C´ eline Brouard, Huibin Shen, Kai D¨ uhrkop, Florence d’Alch´ e Buc, Sebastian B¨ ocker, and Juho Rousu. Fast metabolite identification with input output kernel regression.Bioinformatics, 32(12):i28–i36, 2016

  7. [7]

    Input output kernel regression: Su- pervised and semi-supervised structured output prediction with operator-valued kernels.Journal of Machine Learning Research, 17(176):1–48, 2016

    C´ eline Brouard, Marie Szafranski, and Florence d’Alch´ e Buc. Input output kernel regression: Su- pervised and semi-supervised structured output prediction with operator-valued kernels.Journal of Machine Learning Research, 17(176):1–48, 2016

  8. [8]

    Minimax and adaptive prediction for functional linear regression

    T Tony Cai and Ming Yuan. Minimax and adaptive prediction for functional linear regression. Journal of the American Statistical Association, 107(499):1201–1216, 2012

  9. [9]

    Optimal rates for the regularized least-squares algo- rithm.Foundations of Computational Mathematics, 7:331–368, 2007

    Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algo- rithm.Foundations of Computational Mathematics, 7:331–368, 2007

  10. [10]

    Universal multi- task kernels.Journal of Machine Learning Research, 9:1615–1646, 2008

    Andrea Caponnetto, Charles A Micchelli, Massimiliano Pontil, and Yiming Ying. Universal multi- task kernels.Journal of Machine Learning Research, 9:1615–1646, 2008

  11. [11]

    Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem.Analysis and Applications, 4(04):377– 408, 2006

    Claudio Carmeli, Ernesto De Vito, and Alessandro Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem.Analysis and Applications, 4(04):377– 408, 2006

  12. [12]

    Vector valued reproducing kernel Hilbert spaces and universality.Analysis and Applications, 8(01):19–61, 2010

    Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanit´ a. Vector valued reproducing kernel Hilbert spaces and universality.Analysis and Applications, 8(01):19–61, 2010

  13. [13]

    A consistent regularization approach for structured prediction.Advances in Neural Information Processing Systems, 29, 2016

    Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach for structured prediction.Advances in Neural Information Processing Systems, 29, 2016

  14. [14]

    A general framework for consistent structured prediction with implicit loss embeddings.Journal of Machine Learning Research, 21(98):1–67, 2020

    Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A general framework for consistent structured prediction with implicit loss embeddings.Journal of Machine Learning Research, 21(98):1–67, 2020

  15. [15]

    American Mathematical Society, 2000

    John B Conway.A Course in Operator Theory. American Mathematical Society, 2000

  16. [16]

    Nonparametric stochastic approximation with large step- sizes.The Annals of Statistics, pages 1363–1399, 2016

    Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step- sizes.The Annals of Statistics, pages 1363–1399, 2016

  17. [17]

    Harder, better, faster, stronger convergence rates for least-squares regression.Journal of Machine Learning Research, 18(101):1– 51, 2017

    Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger convergence rates for least-squares regression.Journal of Machine Learning Research, 18(101):1– 51, 2017

  18. [18]

    John Wiley & Sons, 1988

    Nelson Dunford and Jacob T Schwartz.Linear Operators, Part 1: General Theory, volume 10. John Wiley & Sons, 1988

  19. [19]

    Learning multiple tasks with kernel methods.Journal of Machine Learning Research, 6(4), 2005

    Theodoros Evgeniou, Charles A Micchelli, Massimiliano Pontil, and John Shawe-Taylor. Learning multiple tasks with kernel methods.Journal of Machine Learning Research, 6(4), 2005. 54

  20. [20]

    A survey of kernels for structured data.ACM SIGKDD Explorations Newsletter, 5(1):49–58, 2003

    Thomas G¨ artner. A survey of kernels for structured data.ACM SIGKDD Explorations Newsletter, 5(1):49–58, 2003

  21. [21]

    Capacity dependent analysis for functional online learning algorithms.Applied and Computational Harmonic Analysis, 67:101567, 2023

    Xin Guo, Zheng-Chu Guo, and Lei Shi. Capacity dependent analysis for functional online learning algorithms.Applied and Computational Harmonic Analysis, 67:101567, 2023

  22. [22]

    Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao

    Steven C.H. Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. Online learning: A comprehensive survey.Neurocomputing, 459:249–289, 2021

  23. [23]

    Nonlinear functional regression: A functional RKHS approach

    Hachem Kadri, Emmanuel Duflos, Philippe Preux, St´ ephane Canu, and Manuel Davy. Nonlinear functional regression: A functional RKHS approach. InProceedings of the Thirteenth Interna- tional Conference on Artificial Intelligence and Statistics, pages 374–380. JMLR Workshop and Conference Proceedings, 2010

  24. [24]

    Operator-valued kernels for learning from functional response data.Journal of Machine Learning Research, 17(20):1–54, 2016

    Hachem Kadri, Emmanuel Duflos, Philippe Preux, St´ ephane Canu, Alain Rakotomamonjy, and Julien Audiffren. Operator-valued kernels for learning from functional response data.Journal of Machine Learning Research, 17(20):1–54, 2016

  25. [25]

    Functional regularized least squares classification with operator-valued kernels

    Hachem Kadri, Asma Rabaoui, Philippe Preux, Emmanuel Duflos, and Alain Rakotomamonjy. Functional regularized least squares classification with operator-valued kernels. In28th Interna- tional Conference on Machine Learning (ICML), pages 993–1000. ACM, 2011

  26. [26]

    Multiple operator- valued kernel learning.Advances in Neural Information Processing Systems, 25, 2012

    Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, and Francis Bach. Multiple operator- valued kernel learning.Advances in Neural Information Processing Systems, 25, 2012

  27. [27]

    A structured prediction approach for label ranking.Advances in Neural Information Processing Systems, 31, 2018

    Anna Korba, Alexandre Garcia, and Florence d’Alch´ e Buc. A structured prediction approach for label ranking.Advances in Neural Information Processing Systems, 31, 2018

  28. [28]

    Operator learning with PCA-Net: upper and lower complexity bounds.Journal of Machine Learning Research, 24(318):1–67, 2023

    Samuel Lanthaler. Operator learning with PCA-Net: upper and lower complexity bounds.Journal of Machine Learning Research, 24(318):1–67, 2023

  29. [29]

    Error estimates for deep- onets: A deep learning framework in infinite dimensions.Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022

    Samuel Lanthaler, Siddhartha Mishra, and George E Karniadakis. Error estimates for deep- onets: A deep learning framework in infinite dimensions.Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022

  30. [30]

    Fourier neural operator for parametric partial differential equations

    Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik Bhat- tacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. InInternational Conference on Learning Representations, 2020

  31. [31]

    Nonlinear functional models for functional responses in reproducing kernel Hilbert spaces.Canadian Journal of Statistics, 35(4):597–606, 2007

    Heng Lian. Nonlinear functional models for functional responses in reproducing kernel Hilbert spaces.Canadian Journal of Statistics, 35(4):597–606, 2007

  32. [32]

    Statistical optimality of divide and conquer kernel-based functional linear regression.Journal of Machine Learning Research, 25(155):1–56, 2024

    Jiading Liu and Lei Shi. Statistical optimality of divide and conquer kernel-based functional linear regression.Journal of Machine Learning Research, 25(155):1–56, 2024

  33. [33]

    Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators

    Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021

  34. [34]

    On learning vector-valued functions.Neural Com- putation, 17(1):177–204, 2005

    Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions.Neural Com- putation, 17(1):177–204, 2005

  35. [35]

    Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes.Advances in Neural Infor- mation Processing Systems, 31, 2018

    Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes.Advances in Neural Infor- mation Processing Systems, 31, 2018

  36. [36]

    Optimum bounds for the distributions of martingales in banach spaces.The Annals of Probability, pages 1679–1706, 1994

    Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces.The Annals of Probability, pages 1679–1706, 1994

  37. [37]

    Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux associ´ es (noyaux reproduisants).Journal D’analyse Math´ ematique, 13:115–256, 1964

    Laurent Schwartz. Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux associ´ es (noyaux reproduisants).Journal D’analyse Math´ ematique, 13:115–256, 1964. 55

  38. [38]

    Learning operators with stochastic gradient descent in general Hilbert spaces.arXiv preprint arXiv:2402.04691, 2024

    Lei Shi and Jia-Qi Yang. Learning operators with stochastic gradient descent in general Hilbert spaces.arXiv preprint arXiv:2402.04691, 2024

  39. [39]

    Online learning algorithms.Foundations of Computational Mathe- matics, 6:145–170, 2006

    Steve Smale and Yuan Yao. Online learning algorithms.Foundations of Computational Mathe- matics, 6:145–170, 2006

  40. [40]

    Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716– 5735, 2014

    Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716– 5735, 2014

  41. [41]

    Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021

    Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021

  42. [42]

    Cambridge University Press, 2004

    Holger Wendland.Scattered Data Approximation, volume 17. Cambridge University Press, 2004

  43. [43]

    Ker- nel dependency estimation.Advances in Neural Information Processing Systems, 15, 2002

    Jason Weston, Olivier Chapelle, Vladimir Vapnik, Andr´ e Elisseeff, and Bernhard Sch¨ olkopf. Ker- nel dependency estimation.Advances in Neural Information Processing Systems, 15, 2002

  44. [44]

    Learning deep neural network representations for koopman operators of nonlinear dynamical systems

    Enoch Yeung, Soumya Kundu, and Nathan Hodas. Learning deep neural network representations for koopman operators of nonlinear dynamical systems. In2019 American Control Conference (ACC), pages 4832–4839, 2019

  45. [45]

    Online gradient descent learning algorithms.Foundations of Computational Mathematics, 8:561–596, 2008

    Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms.Foundations of Computational Mathematics, 8:561–596, 2008

  46. [46]

    A reproducing kernel Hilbert space approach to functional linear regression.The Annals of Statistics, 38(6):3412–3444, 2010

    Ming Yuan and T Tony Cai. A reproducing kernel Hilbert space approach to functional linear regression.The Annals of Statistics, 38(6):3412–3444, 2010

  47. [47]

    An algorithmic view of l2 regularization and some path-following algorithms.Journal of Machine Learning Research, 22(138):1–62, 2021

    Yunzhang Zhu and Renxiong Liu. An algorithmic view of l2 regularization and some path-following algorithms.Journal of Machine Learning Research, 22(138):1–62, 2021. 56