Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks

Junyu Zhou; Marius Kloft; Philipp Liznerski; Puyu Wang

arxiv: 2601.22409 · v3 · pith:WMNEPMVCnew · submitted 2026-01-29 · 💻 cs.LG · cs.AI· stat.ML

Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks

Puyu Wang , Junyu Zhou , Philipp Liznerski , Marius Kloft This is my paper

Pith reviewed 2026-05-16 09:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords Kolmogorov-Arnold Networksgradient descentdifferential privacygeneralization boundsoptimization ratesnetwork widthlogistic losstwo-layer networks

0 comments

The pith

Gradient descent on two-layer Kolmogorov-Arnold networks reaches optimization and generalization rates of order 1/T and 1/n with polylogarithmic width, and differential privacy makes that width necessary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes gradient descent for two-layer Kolmogorov-Arnold networks and derives bounds that characterize training dynamics, generalization, and utility under differential privacy. Under an NTK-separable assumption with logistic loss, polylogarithmic network width suffices to obtain an optimization rate of order 1/T and a generalization rate of order 1/n. In the private setting the analysis produces a utility bound of order sqrt(d) over n epsilon that matches the classical lower bound for convex Lipschitz problems. This reveals a gap: polylogarithmic width is sufficient without privacy but both sufficient and necessary once differential privacy is imposed.

Core claim

For two-layer KANs trained by gradient descent under the NTK-separable assumption on logistic loss, polylogarithmic network width suffices to achieve an optimization rate of order 1/T after T iterations and a generalization rate of order 1/n for n samples. In the differentially private setting the noise required for (epsilon, delta)-DP yields a utility bound of order sqrt(d)/(n epsilon), matching the lower bound for general convex Lipschitz problems and implying that polylogarithmic width is necessary as well as sufficient under differential privacy.

What carries the argument

Gradient descent dynamics on two-layer KANs specialized under the NTK-separable assumption for logistic loss, which controls the optimization and generalization rates and establishes necessity of polylogarithmic width under differential privacy.

If this is right

Polylogarithmic width suffices for GD to converge at rate 1/T.
Generalization error scales as 1/n with sample size.
Private utility matches the classical sqrt(d) lower bound for convex Lipschitz problems.
Polylogarithmic width becomes necessary once differential privacy is required.
The derived rates can guide practical choices of network width and early stopping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The private versus non-private width gap may appear in other network families when similar assumptions hold.
Practitioners could set minimal widths for private training directly from the polylogarithmic threshold.
The tightness to classical lower bounds suggests the analysis captures the essential cost of privacy for this architecture.
Width selection informed by these bounds may improve trade-offs between privacy and accuracy in deployed models.

Load-bearing premise

The training data must satisfy an NTK-separable assumption under logistic loss so the general bounds specialize to the fast rates and the privacy necessity result.

What would settle it

Observing that a network with sub-polylogarithmic width achieves the claimed 1/T optimization rate on NTK-separable logistic-loss data, or that private training with such narrow networks exceeds the sqrt(d)/(n epsilon) utility bound, would contradict the sufficiency and necessity claims.

Figures

Figures reproduced from arXiv: 2601.22409 by Junyu Zhou, Marius Kloft, Philipp Liznerski, Puyu Wang.

**Figure 2.** Figure 2: (a) Training and test accuracy as a function of the width m (with T fixed) and the number of iterations T (with m fixed) for GD, and (b) private utility as a function of m and T for DP-GD, on synthetic logistic data (top row) and MNIST (bottom row). In (a), the vertical dashed lines indicate empirically observed change points beyond which increasing the width m yields diminishing or flat training and test … view at source ↗

**Figure 3.** Figure 3: Training and test losses versus m and T. E.3 Experiments The paper shows results for two types of experiments. Loss over m. Here, we vary the model width m and report the resulting training and test metrics of the model for each 50 random seeds on the synthetic data and 20 random seeds on MNIST. Loss over T. Here, we vary the full training dataset iterations T and report the resulting training and test met… view at source ↗

read the original abstract

Kolmogorov--Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(\epsilon,\delta)$-DP and obtain a utility bound of order $\sqrt{d}/(n\epsilon)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends GD analysis to KANs and gets clean polylog-width rates plus a DP utility bound, but both the fast rates and the necessity claim route through the NTK-separable assumption.

read the letter

The main point is that this work takes existing gradient descent theory and specializes it to two-layer KANs. Under logistic loss and the NTK-separable assumption they obtain 1/T optimization and 1/n generalization with only polylog width, then show a private utility bound of order sqrt(d)/(n epsilon) that matches the classical convex lower bound. They also argue that polylog width is necessary once privacy is imposed, which creates a qualitative difference from the non-private case. The experiments give some practical pointers on width choice and early stopping, which is useful even if the theory is the core contribution. What they do cleanly is first state general bounds before specializing, so the reader can see where the assumption enters. The DP matching lower bound is a nice concrete result. The soft spot is exactly the one flagged in the stress-test note. The claimed rates and the necessity statement both depend on the NTK-separable assumption for logistic loss; without it the paper does not deliver the stated exponents. The necessity claim under DP does not appear to be shown independently of that modeling choice, so the gap between private and non-private regimes is narrower than the abstract suggests if the assumption fails on typical data. No machine-checked proofs or parameter-free derivations are referenced, which leaves the tightness of the bounds harder to assess. This paper is for readers who follow theoretical work on KANs or privacy-aware training of non-standard architectures. A colleague who wants concrete bounds to compare against MLP results will get something out of it. It deserves peer review because the topic is timely and the results are stated sharply enough to be checked and revised.

Referee Report

3 major / 1 minor

Summary. The paper analyzes gradient descent training of two-layer Kolmogorov-Arnold Networks, deriving general bounds on optimization dynamics, generalization, and differential privacy utility. Specializing to logistic loss under an NTK-separable assumption, it claims that polylogarithmic width suffices to achieve O(1/T) optimization and O(1/n) generalization rates. In the private setting it derives a utility bound of order √d/(nε) that matches the classical convex Lipschitz lower bound and argues that polylogarithmic width is also necessary under DP, revealing a qualitative gap between the private and non-private regimes. Experiments are used to illustrate practical guidance on width selection and early stopping.

Significance. If the results hold, the work supplies the first systematic theoretical treatment of optimization, generalization, and privacy for KANs under gradient descent. The matching of the classical DP lower bound and the claimed distinction in width requirements between private and non-private training would be useful for guiding architecture design in this emerging model family. The dependence on the NTK-separable assumption, however, restricts the immediate scope of the rates and the necessity claim.

major comments (3)

[Abstract] Abstract: the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing the general analysis to logistic loss under the NTK-separable assumption; the manuscript does not show that these rates follow from the general bounds without this assumption, making the assumption load-bearing for the central sufficiency claims.
[Abstract] Abstract: the necessity of polylogarithmic width under (ε,δ)-DP is asserted, yet the argument appears to route through the same NTK-separable specialization used for sufficiency; without an independent argument or a demonstration that necessity holds for general convex Lipschitz problems, the claimed qualitative gap between private and non-private regimes is not fully supported.
[Abstract] Abstract: the utility bound of order √d/(nε) is stated to match the classical lower bound, but the derivation of the noise scale required for (ε,δ)-DP and the precise dependence on network width are not exhibited in the provided text, preventing verification that the bound remains tight once the NTK-separable assumption is imposed.

minor comments (1)

The abstract refers to 'two-layer KANs' without specifying the precise form of the univariate functions or the spline basis used; a brief clarification would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments help us clarify the precise scope of our results and the role of the NTK-separable assumption. We address each major comment below and will revise the abstract and introduction for greater precision.

read point-by-point responses

Referee: [Abstract] Abstract: the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing the general analysis to logistic loss under the NTK-separable assumption; the manuscript does not show that these rates follow from the general bounds without this assumption, making the assumption load-bearing for the central sufficiency claims.

Authors: We agree that the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing to logistic loss under the NTK-separable assumption. The general bounds derived earlier in the paper apply to two-layer KANs without this assumption but yield weaker rates that depend on additional problem parameters. The specialization is explicitly stated in the abstract and is load-bearing for the tight rates; we will revise the abstract to emphasize this distinction more clearly and add a sentence noting that the general bounds serve as the foundation from which the specialized rates are derived. revision: yes
Referee: [Abstract] Abstract: the necessity of polylogarithmic width under (ε,δ)-DP is asserted, yet the argument appears to route through the same NTK-separable specialization used for sufficiency; without an independent argument or a demonstration that necessity holds for general convex Lipschitz problems, the claimed qualitative gap between private and non-private regimes is not fully supported.

Authors: The necessity claim for polylogarithmic width is established within the same NTK-separable logistic-loss setting used for the sufficiency results. We do not claim or prove necessity for arbitrary convex Lipschitz problems; our focus is on the KAN model class. The qualitative gap we highlight is therefore internal to this setting: polylogarithmic width suffices for the non-private rates, while the same width becomes necessary to achieve the stated utility under DP. We will add a clarifying sentence in the abstract and introduction to restrict the necessity statement to the specialized regime and briefly discuss why the DP analysis forces the width lower bound in this case. revision: partial
Referee: [Abstract] Abstract: the utility bound of order √d/(nε) is stated to match the classical lower bound, but the derivation of the noise scale required for (ε,δ)-DP and the precise dependence on network width are not exhibited in the provided text, preventing verification that the bound remains tight once the NTK-separable assumption is imposed.

Authors: The noise scale for (ε,δ)-DP is derived in Section 4 by applying the Gaussian mechanism to the per-iteration gradients, with sensitivity bounded using the NTK feature map under the separability assumption. The resulting utility bound of order √d/(nε) is obtained after substituting the polylogarithmic width that suffices for optimization; the bound is independent of width once this threshold is met and matches the classical convex Lipschitz lower bound. The full derivation, including the width dependence in the sensitivity, appears in the main text. We will add a parenthetical reference to Section 4 in the abstract to facilitate verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from explicit NTK-separable assumption and external classical bounds

full rationale

The paper first states general bounds on GD dynamics, generalization, and DP utility for two-layer KANs. It then specializes these bounds to logistic loss under the explicitly declared NTK-separable assumption to obtain the 1/T optimization and 1/n generalization rates with polylog width. The private utility bound of order √d/(nε) is obtained by adding Gaussian noise calibrated to the sensitivity and is shown to match the known classical lower bound for convex Lipschitz problems (an external result, not derived inside the paper). The necessity claim for polylog width under DP likewise rests on this matching to the external convex lower bound rather than on any self-referential construction or fitted parameter. No equation reduces to a prior equation by definition, no parameter is fitted on a data subset and then relabeled a prediction, and no load-bearing step relies on a self-citation whose content is itself unverified. The analysis is therefore self-contained once the assumption is granted, yielding a circularity score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the NTK-separable assumption as a key domain assumption. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption NTK-separable assumption on the data distribution for logistic loss
This assumption allows the specialization of the general bounds to achieve the stated optimization and generalization rates with polylog width.

pith-pipeline@v0.9.0 · 5549 in / 1372 out tokens · 37851 ms · 2026-05-16T09:31:00.048163+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages

[1]

Deep learning with differential privacy

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, 2016

work page 2016
[2]

A convergence theory for deep learning via over- parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

work page 2019
[3]

On functions of three variables

Vladimir Igorevich Arnol’d. On functions of three variables. InDoklady Akademii Nauk, volume 114, pages 679–681. Russian Academy of Sciences, 1957

work page 1957
[4]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

work page 2019
[5]

Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[6]

Private stochastic convex optimization with optimal rates

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. InAdvances in neural information processing systems, volume 32, 2019

work page 2019
[7]

Differentially private stochastic optimization: New results in convex and non-convex settings

Raef Bassily, Crist´ obal Guzm´ an, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. InAdvances in Neural Information Processing Systems, volume 34, pages 9317–9329, 2021

work page 2021
[8]

Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024

Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau. Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024

work page arXiv 2024
[9]

On a constructive proof of kolmogorov’s superposition theorem

J¨ urgen Braun and Michael Griebel. On a constructive proof of kolmogorov’s superposition theorem. Constructive approximation, 30(3):653–675, 2009

work page 2009
[10]

On the convergence and calibration of deep learning with differential privacy.Transactions on machine learning research, 2023:https–openreview, 2023

Zhiqi Bu, Hua Wang, Zongyu Dai, and Qi Long. On the convergence and calibration of deep learning with differential privacy.Transactions on machine learning research, 2023:https–openreview, 2023

work page 2023
[11]

Generalization bounds of stochastic gradient descent for wide and deep neural networks

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[12]

How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021. 12

work page 2021
[13]

Poptsova

Oleksandr Cherednichenko and Maria S. Poptsova. Kolmogorov-arnold networks for genomic tasks. Briefings Bioinform., 26(2), 2025

work page 2025
[14]

The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

Li Deng. The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

work page 2012
[15]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

work page 2019
[16]

Calibrating noise to sensitivity in private data analysis

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference, pages 265–284. Springer, 2006

work page 2006
[17]

The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science, 9(3–4):211–407, 2014

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science, 9(3–4):211–407, 2014

work page 2014
[18]

Kan we improve on hep classification tasks? kolmogorov–arnold networks applied to an lhc physics example.Computing and Software for Big Science, 9(1):9, 2025

Johannes Erdmann, Florian Mausolf, and Jan Lukas Sp¨ ah. Kan we improve on hep classification tasks? kolmogorov–arnold networks applied to an lhc physics example.Computing and Software for Big Science, 9(1):9, 2025

work page 2025
[19]

Kanice: Kolmogorov-arnold networks with interactive convolutional elements

Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, David Dobson, Kendall N Niles, Ken Pathak, and Steven Sloan. Kanice: Kolmogorov-arnold networks with interactive convolutional elements. In Proceedings of the 4th International Conference on AI-ML Systems, pages 1–10, 2024

work page 2024
[20]

Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

work page 2023
[21]

On the convergence of (stochastic) gradient descent for kolmogorov– arnold networks.IEEE Transactions on Information Theory, 2025

Yihang Gao and Vincent YF Tan. On the convergence of (stochastic) gradient descent for kolmogorov– arnold networks.IEEE Transactions on Information Theory, 2025

work page 2025
[22]

A temporal Kolmogorov-Arnold tran sformer for time series forecasting,

Remi Genet and Hugo Inzirillo. A temporal kolmogorov-arnold transformer for time series forecasting. arXiv preprint arXiv:2406.02486, 2024

work page arXiv 2024
[23]

Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[24]

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

work page 2020
[25]

Andre˘ ı Nikolaevich Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.Translations American Mathematical Society, 2(28):55–59, 1963

work page 1963
[26]

Stability and generalization analysis of gradient methods for shallow neural networks

Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

work page 2022
[27]

Fine-grained analysis of stability and generalization for stochastic gradient descent

Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, pages 5809–5819, 2020

work page 2020
[28]

Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7:1346–1354, 08 2025

Longlong Li, Yipeng Zhang, Guanghui Wang, and Kelin Xia. Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7:1346–1354, 08 2025

work page 2025
[29]

Generalization bounds for kolmogorov- arnold networks (kans) and enhanced kans with lower lipschitz complexity

Pengqi Li, Lizhong Ding, Jiarun Fu, Guoren Wang, Ye Yuan, et al. Generalization bounds for kolmogorov- arnold networks (kans) and enhanced kans with lower lipschitz complexity. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[30]

On the rate of convergence of K olmogorov-Arnold Network regression estimators,

Wei Liu, Eleni Chatzi, and Zhilu Lai. On the rate of convergence of kolmogorov-arnold network regression estimators.arXiv preprint arXiv:2509.19830, 2025. 13

work page arXiv 2025
[31]

Kan: Kolmogorov-arnold networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks. InInternational Conference on Learning Representations, 2025

work page 2025
[32]

How to make the gradients small privately: Improved rates for differentially private non-convex optimization

Andrew Lowy, Jonathan Ullman, and Stephen J Wright. How to make the gradients small privately: Improved rates for differentially private non-convex optimization. InInternational Conference on Machine Learning, 2024

work page 2024
[33]

R´ enyi differential privacy

Ilya Mironov. R´ enyi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017

work page 2017
[34]

How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

Mike Nguyen and Nicole M¨ ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

work page 2024
[35]

Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

work page arXiv 1905
[36]

Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021

work page 2021
[37]

Overparameterized nonlinear learning: Gradient descent takes the shortest path? InInternational Conference on Machine Learning, pages 4951–4960

Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? InInternational Conference on Machine Learning, pages 4951–4960. PMLR, 2019

work page 2019
[38]

Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

Subhajit Patra, Sonali Panda, Bikram Keshari Parida, Mahima Arya, Kurt Jacobs, Denys I Bondar, and Abhijit Sen. Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

work page 2025
[39]

Finding local diffusion schrodinger bridge using kolmogorov-arnold network

Xingyu Qiu, Mengying Yang, Xinghua Ma, Fanding Li, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, and Shuo Li. Finding local diffusion schrodinger bridge using kolmogorov-arnold network. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23227–23236, 2025

work page 2025
[40]

GI NN-KAN: Interpretability pipelining with applications in physics-informed neural networks,

Nisal Ranasinghe, Yu Xia, Sachith Seneviratne, and Saman Halgamuge. Ginn-kan: Interpretability pipelining with applications in physics informed neural networks.arXiv preprint arXiv:2408.14780, 2024

work page arXiv 2024
[41]

Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel

Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021

work page 2021
[42]

Convex approximation of two-layer relu networks for hidden state differential privacy

Rob Romijnders and Antti Koskela. Convex approximation of two-layer relu networks for hidden state differential privacy. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[43]

Stability vs implicit bias of gradient methods on separable data and beyond

Matan Schliserman and Tomer Koren. Stability vs implicit bias of gradient methods on separable data and beyond. InConference on Learning Theory, pages 3380–3394. PMLR, 2022

work page 2022
[44]

Towards understanding generalization in dp-gd: A case study in training two-layer cnns

Zhongjie Shi, Puyu Wang, Chenyang Zhang, and Yuan Cao. Towards understanding generalization in dp-gd: A case study in training two-layer cnns. InThe Fortieth AAAI Conference on Artificial Intelligence, 2026

work page 2026
[45]

Khemraj Shukla, Juan Diego Toscano, Zhicheng Wang, Zongren Zou, and George Em Karniadakis. A comprehensive and fair comparison between mlp and kan representations for differential equations and operator networks.Computer Methods in Applied Mechanics and Engineering, 431:117290, 2024

work page 2024
[46]

A survey on kolmogorov-arnold network.ACM Computing Surveys, 58(2):1–35, 2025

Shriyank Somvanshi, Syed Aaqib Javed, Md Monzurul Islam, Diwas Pandit, and Subasish Das. A survey on kolmogorov-arnold network.ACM Computing Surveys, 58(2):1–35, 2025

work page 2025
[47]

Stochastic gradient descent with differentially private updates

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013. 14

work page 2013
[48]

Kan-ddpm: Kolmogorov-arnold networks with diffusion denoising probabilistic models for mri-to-ct synthesis

Vanessa Su and Xiaofeng Yang. Kan-ddpm: Kolmogorov-arnold networks with diffusion denoising probabilistic models for mri-to-ct synthesis. InMedical Imaging 2025: Imaging Informatics, volume 13411, pages 227–233. SPIE, 2025

work page 2025
[49]

Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

work page 2024
[50]

Sharper guarantees for learning neural network classifiers with gradient methods

Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025

work page 2025
[51]

Kolmogorov- arnold networks (kans) for time series analysis,

Cristian J Vaca-Rubio, Luis Blanco, Roberto Pereira, and M` arius Caus. Kolmogorov-arnold networks (kans) for time series analysis.arXiv preprint arXiv:2405.08790, 2024

work page arXiv 2024
[52]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

work page 2018
[53]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

work page 2019
[54]

Differentially private empirical risk minimization with non-convex loss functions

Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. InInternational Conference on Machine Learning, pages 6526–6535. PMLR, 2019

work page 2019
[55]

Optimal utility bounds for differentially private gradient descent in three-layer neural networks

Puyu Wang, Yunwen Lei, Marius Kloft, and Yiming Ying. Optimal utility bounds for differentially private gradient descent in three-layer neural networks. In2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–8. IEEE, 2025

work page 2025
[56]

Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

work page 2025
[57]

On the expressiveness and spectral bias of kans

Yixuan Wang, Jonathan W Siegel, Ziming Liu, and Thomas Y Hou. On the expressiveness and spectral bias of kans. InInternational Conference on Learning Representations, 2025

work page 2025
[58]

Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks

Yizheng Wang, Jia Sun, Jinshuai Bai, Cosmin Anitescu, Mohammad Sadegh Eshaghi, Xiaoying Zhuang, Timon Rabczuk, and Yinghua Liu. Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks. Computer Methods in Applied Mechanics and Engineering, 433:117518, 2025

work page 2025
[59]

A conditional kan diffusion network for human activity recognition with missing sensor signal series

Hao Xiong, Jiayi Gong, Haiyong Luo, Fang Zhao, Yang Gao, Runze Chen, and Mingyu Xiao. A conditional kan diffusion network for human activity recognition with missing sensor signal series. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[60]

Efficient private erm for smooth objectives

Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private erm for smooth objectives. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3922–3928, 2017

work page 2017
[61]

Generalization bounds and model complexity for kolmogorov-arnold networks

Xianyang Zhang and Huijuan Zhou. Generalization bounds and model complexity for kolmogorov-arnold networks. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[62]

with probability at least 1−δ

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 15 Appendix A Useful Lemmas In this section, we introduce several useful lemmas that will be used in the proofs. Lemma A.1(Concentration of the norm of Sub-Gaussian vectors [53]).Letu ∈R s be a centered ...

work page 2020
[63]

λmin ∇2LS(Θα,t) ≥ − cmax√m LS(Θα,t).(19) wherec max =C σ,b p 3 2 p log( m δ ) + √p+ 3 Θ(0)−Θ ∗ 2

Then, from Proposition B.2 we know the least eigenvalue of∇ 2LS(Θα,t) can be controlled as follows. λmin ∇2LS(Θα,t) ≥ − cmax√m LS(Θα,t).(19) wherec max =C σ,b p 3 2 p log( m δ ) + √p+ 3 Θ(0)−Θ ∗ 2 . Then from Taylor’s theorem and (19), we know that there exists a α∈ [0, 1] and the associated Θ α,t such that with probability at least 1−δ, it holds that LS(...

work page
[64]

Then, the above inequality can be further bounded as follows λmax ∇2LS(Θ(t+ 1)) ≤C σ,b p2 p+ B2 b p m ∥Θ(0)−Θ ∗∥4 2 ≤C σ,bp3 =:ρ ℓ, where the last inequality used the condition m≳∥ Θ(0) − Θ∗∥4

work page
[65]

with probability at least 1 −δ

Then we know ℓ(Θ(t + 1)) is ρℓ-strongly smooth through the trajectory of {Θ(t)}t∈[k]. From the descent lemma (see e.g., [ 49]), we know if η≤ 1/ρℓ, it holds for anyt∈[k] that LS(Θ(t+ 1))≤ L S(Θ(t))− η 2 ∇LS(Θ(t)) 2 2.(21) Combining the above inequality and (20), we know LS(Θ(t+ 1)) ≤ L S(Θ∗)− ∇LS(Θ(t)),Θ ∗ −Θ(t) − η 2 ∇LS(Θ(t)) 2 2 + cmax 2√m max α∈[0,1] ...

work page
[66]

with probability at least 1−δ

Then, the conditions m≳ (log(m/δ) + g2(1/T ))g4(1/T ) and η≲min g2(1), g2(1) LS(Θ(0)) −1 ensure that all the conditions in Theorem 4.3 are satisfied. Applying Theorem 4.3 with Θ ∗ = Θ1/T implies that LS(Θ(T))≲ 2ηTL S(Θ1/T ) + ΛΘ1/T ηT ≤ 2η+g 2 1 T ηT . This completes the proof of the Theorem. 24 The proof of Theorem 4.9 is given as follows. Proof of Theor...

work page
[67]

Then, it holds that ES L(Θ(T))− L S(Θ(T)) ≤C σ,bp2 p+p∥Θ ∗ −Θ(0)∥ 2 2 η n ES h TX t=0 LS Θ(t) i

max ∥Θ(0)−Θ ∗∥2 2,∥Θ(0)−Θ ∗∥4 2 . Then, it holds that ES L(Θ(T))− L S(Θ(T)) ≤C σ,bp2 p+p∥Θ ∗ −Θ(0)∥ 2 2 η n ES h TX t=0 LS Θ(t) i . 26 Proof. For any i∈ [n], let {Θi(k)} and {Θ−i(k)} be produced by GD based Si and S−i, respectively. Observe that the condition ∥Θ(0) − Θ∗∥2 2 ≥ 8 max{ηT LS(Θ∗) + L eS(Θ∗) , η LS(Θ(0)) + L eS(Θ(0)) } implies ∥Θ(0) − Θ∗∥2 2 ≥ ...

work page
[68]

By further noting that (7) holds with probability at least 1 −δ over initialization c(0), this completes the proof of the theorem

Combining this with Theorems B.4 and C.4 yields ES L(Θ(T)) ≲ ηΛ2 Θ∗ n ES h TX t=0 LS Θ(t) i +E S LS(Θ(T)) ≲ Λ2 Θ∗ n + 1 ηT C(Θ∗), where C(Θ∗) = ES CS(Θ∗) . By further noting that (7) holds with probability at least 1 −δ over initialization c(0), this completes the proof of the theorem. Theorem C.5(Generalization under Realizability).Suppose Assumption 4.5...

work page
[69]

Note Lemma A.2 implies ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ) with probability at least 1 −δ/ 2

We begin by estimating the ℓ2-sensitivity of ∂aLS(eΘ(k)) and ∂cLS(eΘ(k)) at each iteration k. Note Lemma A.2 implies ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ) with probability at least 1 −δ/ 2. In the following proof, we assume the event {c(0) : ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ)} holds and aim to show Algorithm 1 satisfies (ϵ, δ/2)-DP. Noting thatSandS ′ differ in the fi...

work page
[70]

Combining (37) and the above equality and dividing byηyield g,Θ ∗ −Θ + ≥ 1 2η ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2

Applying this with (a,b,c) = (Θ,Θ +,Θ ∗) gives Θ−Θ +,Θ ∗ −Θ + = 1 2 ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2 . Combining (37) and the above equality and dividing byηyield g,Θ ∗ −Θ + ≥ 1 2η ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2 . Therefore, g,Θ−Θ ∗ = g,Θ−Θ + − g,Θ ∗ −Θ + ≤ g,Θ−Θ + + 1 2η ∥Θ−Θ ∗∥2 2 − ∥Θ+ −Θ ∗∥2 2 − ∥Θ−Θ +∥2 2 ≤ η 2 ∥g∥2 2 + 1 2η ∥Θ−Θ +∥2 2 + ...

work page
[71]

By further using eΛ2 Θ∗ ≥ηL S(eΘ(0)), it holds 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 4 ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dlog(2T /δ) n2ϵ2

It then follows that TX k=1 EA LS(eΘ(k)) ≤4TL S(Θ∗) + 3 2η ∥eΘ(0)−Θ ∗∥2 2 + 9mpηT 2(1 + log(2T /δ) ϵ ) 2n2ϵ C1d+C 2 + 2LS(eΘ(0)).(42) Therefore, 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 3 2ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dlog(2T /δ) n2ϵ2 + 2 T LS(eΘ(0)). By further using eΛ2 Θ∗ ≥ηL S(eΘ(0)), it holds 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 4 ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dl...

work page
[72]

These match the width conditions stated in the theorem

and m≲ (nϵ)4 p10(log( m δ )+R2)d2(ηT) 4 log2(T /δ). These match the width conditions stated in the theorem. Then, from Lemma D.11 we know the algorithmAis on average argument ∥ eΘ(0)−Θ∗∥2 2 n + mp4dη2T 2 log(T /δ) n3ϵ2 -stable. From Lemma B.1, the condition for m and the fact maxi∈[n] ∥ci(k)∥2 ≤ ∥c(k) − c(0)∥2 ≤R 2 ≤R , we know the Lipschitz constant of t...

work page
[73]

This completes the proof of the theorem

≳R 4. This completes the proof of the theorem. Theorem D.13(Restatement of Theorem 4.15).Let the sequence {eΘ(k)}T k=1 be produced by Algorithm 1 with step size η > 0. LetΘ ∗ be an reference point that obtain small training loss satisfying ∥eΘ(0) − Θ∗∥2 2 ≥ Cmax{ηT (LS(Θ∗)+ L eS(Θ∗)), η(LS(eΘ(0))+ L eS(eΘ(0)))}. If η≤min{ 1/3¯ρ,1}, m≳p 2(log( m δ )+ R2)(R...

work page 2048

[1] [1]

Deep learning with differential privacy

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, 2016

work page 2016

[2] [2]

A convergence theory for deep learning via over- parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

work page 2019

[3] [3]

On functions of three variables

Vladimir Igorevich Arnol’d. On functions of three variables. InDoklady Akademii Nauk, volume 114, pages 679–681. Russian Academy of Sciences, 1957

work page 1957

[4] [4]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

work page 2019

[5] [5]

Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[6] [6]

Private stochastic convex optimization with optimal rates

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. InAdvances in neural information processing systems, volume 32, 2019

work page 2019

[7] [7]

Differentially private stochastic optimization: New results in convex and non-convex settings

Raef Bassily, Crist´ obal Guzm´ an, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. InAdvances in Neural Information Processing Systems, volume 34, pages 9317–9329, 2021

work page 2021

[8] [8]

Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024

Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau. Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024

work page arXiv 2024

[9] [9]

On a constructive proof of kolmogorov’s superposition theorem

J¨ urgen Braun and Michael Griebel. On a constructive proof of kolmogorov’s superposition theorem. Constructive approximation, 30(3):653–675, 2009

work page 2009

[10] [10]

On the convergence and calibration of deep learning with differential privacy.Transactions on machine learning research, 2023:https–openreview, 2023

Zhiqi Bu, Hua Wang, Zongyu Dai, and Qi Long. On the convergence and calibration of deep learning with differential privacy.Transactions on machine learning research, 2023:https–openreview, 2023

work page 2023

[11] [11]

Generalization bounds of stochastic gradient descent for wide and deep neural networks

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[12] [12]

How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021. 12

work page 2021

[13] [13]

Poptsova

Oleksandr Cherednichenko and Maria S. Poptsova. Kolmogorov-arnold networks for genomic tasks. Briefings Bioinform., 26(2), 2025

work page 2025

[14] [14]

The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

Li Deng. The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

work page 2012

[15] [15]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

work page 2019

[16] [16]

Calibrating noise to sensitivity in private data analysis

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference, pages 265–284. Springer, 2006

work page 2006

[17] [17]

The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science, 9(3–4):211–407, 2014

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science, 9(3–4):211–407, 2014

work page 2014

[18] [18]

Kan we improve on hep classification tasks? kolmogorov–arnold networks applied to an lhc physics example.Computing and Software for Big Science, 9(1):9, 2025

Johannes Erdmann, Florian Mausolf, and Jan Lukas Sp¨ ah. Kan we improve on hep classification tasks? kolmogorov–arnold networks applied to an lhc physics example.Computing and Software for Big Science, 9(1):9, 2025

work page 2025

[19] [19]

Kanice: Kolmogorov-arnold networks with interactive convolutional elements

Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, David Dobson, Kendall N Niles, Ken Pathak, and Steven Sloan. Kanice: Kolmogorov-arnold networks with interactive convolutional elements. In Proceedings of the 4th International Conference on AI-ML Systems, pages 1–10, 2024

work page 2024

[20] [20]

Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

work page 2023

[21] [21]

On the convergence of (stochastic) gradient descent for kolmogorov– arnold networks.IEEE Transactions on Information Theory, 2025

Yihang Gao and Vincent YF Tan. On the convergence of (stochastic) gradient descent for kolmogorov– arnold networks.IEEE Transactions on Information Theory, 2025

work page 2025

[22] [22]

A temporal Kolmogorov-Arnold tran sformer for time series forecasting,

Remi Genet and Hugo Inzirillo. A temporal kolmogorov-arnold transformer for time series forecasting. arXiv preprint arXiv:2406.02486, 2024

work page arXiv 2024

[23] [23]

Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[24] [24]

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

work page 2020

[25] [25]

Andre˘ ı Nikolaevich Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.Translations American Mathematical Society, 2(28):55–59, 1963

work page 1963

[26] [26]

Stability and generalization analysis of gradient methods for shallow neural networks

Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

work page 2022

[27] [27]

Fine-grained analysis of stability and generalization for stochastic gradient descent

Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, pages 5809–5819, 2020

work page 2020

[28] [28]

Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7:1346–1354, 08 2025

Longlong Li, Yipeng Zhang, Guanghui Wang, and Kelin Xia. Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7:1346–1354, 08 2025

work page 2025

[29] [29]

Generalization bounds for kolmogorov- arnold networks (kans) and enhanced kans with lower lipschitz complexity

Pengqi Li, Lizhong Ding, Jiarun Fu, Guoren Wang, Ye Yuan, et al. Generalization bounds for kolmogorov- arnold networks (kans) and enhanced kans with lower lipschitz complexity. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[30] [30]

On the rate of convergence of K olmogorov-Arnold Network regression estimators,

Wei Liu, Eleni Chatzi, and Zhilu Lai. On the rate of convergence of kolmogorov-arnold network regression estimators.arXiv preprint arXiv:2509.19830, 2025. 13

work page arXiv 2025

[31] [31]

Kan: Kolmogorov-arnold networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks. InInternational Conference on Learning Representations, 2025

work page 2025

[32] [32]

How to make the gradients small privately: Improved rates for differentially private non-convex optimization

Andrew Lowy, Jonathan Ullman, and Stephen J Wright. How to make the gradients small privately: Improved rates for differentially private non-convex optimization. InInternational Conference on Machine Learning, 2024

work page 2024

[33] [33]

R´ enyi differential privacy

Ilya Mironov. R´ enyi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017

work page 2017

[34] [34]

How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

Mike Nguyen and Nicole M¨ ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

work page 2024

[35] [35]

Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

work page arXiv 1905

[36] [36]

Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021

work page 2021

[37] [37]

Overparameterized nonlinear learning: Gradient descent takes the shortest path? InInternational Conference on Machine Learning, pages 4951–4960

Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? InInternational Conference on Machine Learning, pages 4951–4960. PMLR, 2019

work page 2019

[38] [38]

Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

Subhajit Patra, Sonali Panda, Bikram Keshari Parida, Mahima Arya, Kurt Jacobs, Denys I Bondar, and Abhijit Sen. Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

work page 2025

[39] [39]

Finding local diffusion schrodinger bridge using kolmogorov-arnold network

Xingyu Qiu, Mengying Yang, Xinghua Ma, Fanding Li, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, and Shuo Li. Finding local diffusion schrodinger bridge using kolmogorov-arnold network. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23227–23236, 2025

work page 2025

[40] [40]

GI NN-KAN: Interpretability pipelining with applications in physics-informed neural networks,

Nisal Ranasinghe, Yu Xia, Sachith Seneviratne, and Saman Halgamuge. Ginn-kan: Interpretability pipelining with applications in physics informed neural networks.arXiv preprint arXiv:2408.14780, 2024

work page arXiv 2024

[41] [41]

Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel

Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021

work page 2021

[42] [42]

Convex approximation of two-layer relu networks for hidden state differential privacy

Rob Romijnders and Antti Koskela. Convex approximation of two-layer relu networks for hidden state differential privacy. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[43] [43]

Stability vs implicit bias of gradient methods on separable data and beyond

Matan Schliserman and Tomer Koren. Stability vs implicit bias of gradient methods on separable data and beyond. InConference on Learning Theory, pages 3380–3394. PMLR, 2022

work page 2022

[44] [44]

Towards understanding generalization in dp-gd: A case study in training two-layer cnns

Zhongjie Shi, Puyu Wang, Chenyang Zhang, and Yuan Cao. Towards understanding generalization in dp-gd: A case study in training two-layer cnns. InThe Fortieth AAAI Conference on Artificial Intelligence, 2026

work page 2026

[45] [45]

Khemraj Shukla, Juan Diego Toscano, Zhicheng Wang, Zongren Zou, and George Em Karniadakis. A comprehensive and fair comparison between mlp and kan representations for differential equations and operator networks.Computer Methods in Applied Mechanics and Engineering, 431:117290, 2024

work page 2024

[46] [46]

A survey on kolmogorov-arnold network.ACM Computing Surveys, 58(2):1–35, 2025

Shriyank Somvanshi, Syed Aaqib Javed, Md Monzurul Islam, Diwas Pandit, and Subasish Das. A survey on kolmogorov-arnold network.ACM Computing Surveys, 58(2):1–35, 2025

work page 2025

[47] [47]

Stochastic gradient descent with differentially private updates

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013. 14

work page 2013

[48] [48]

Kan-ddpm: Kolmogorov-arnold networks with diffusion denoising probabilistic models for mri-to-ct synthesis

Vanessa Su and Xiaofeng Yang. Kan-ddpm: Kolmogorov-arnold networks with diffusion denoising probabilistic models for mri-to-ct synthesis. InMedical Imaging 2025: Imaging Informatics, volume 13411, pages 227–233. SPIE, 2025

work page 2025

[49] [49]

Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

work page 2024

[50] [50]

Sharper guarantees for learning neural network classifiers with gradient methods

Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025

work page 2025

[51] [51]

Kolmogorov- arnold networks (kans) for time series analysis,

Cristian J Vaca-Rubio, Luis Blanco, Roberto Pereira, and M` arius Caus. Kolmogorov-arnold networks (kans) for time series analysis.arXiv preprint arXiv:2405.08790, 2024

work page arXiv 2024

[52] [52]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

work page 2018

[53] [53]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

work page 2019

[54] [54]

Differentially private empirical risk minimization with non-convex loss functions

Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. InInternational Conference on Machine Learning, pages 6526–6535. PMLR, 2019

work page 2019

[55] [55]

Optimal utility bounds for differentially private gradient descent in three-layer neural networks

Puyu Wang, Yunwen Lei, Marius Kloft, and Yiming Ying. Optimal utility bounds for differentially private gradient descent in three-layer neural networks. In2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–8. IEEE, 2025

work page 2025

[56] [56]

Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

work page 2025

[57] [57]

On the expressiveness and spectral bias of kans

Yixuan Wang, Jonathan W Siegel, Ziming Liu, and Thomas Y Hou. On the expressiveness and spectral bias of kans. InInternational Conference on Learning Representations, 2025

work page 2025

[58] [58]

Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks

Yizheng Wang, Jia Sun, Jinshuai Bai, Cosmin Anitescu, Mohammad Sadegh Eshaghi, Xiaoying Zhuang, Timon Rabczuk, and Yinghua Liu. Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks. Computer Methods in Applied Mechanics and Engineering, 433:117518, 2025

work page 2025

[59] [59]

A conditional kan diffusion network for human activity recognition with missing sensor signal series

Hao Xiong, Jiayi Gong, Haiyong Luo, Fang Zhao, Yang Gao, Runze Chen, and Mingyu Xiao. A conditional kan diffusion network for human activity recognition with missing sensor signal series. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[60] [60]

Efficient private erm for smooth objectives

Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private erm for smooth objectives. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3922–3928, 2017

work page 2017

[61] [61]

Generalization bounds and model complexity for kolmogorov-arnold networks

Xianyang Zhang and Huijuan Zhou. Generalization bounds and model complexity for kolmogorov-arnold networks. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[62] [62]

with probability at least 1−δ

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 15 Appendix A Useful Lemmas In this section, we introduce several useful lemmas that will be used in the proofs. Lemma A.1(Concentration of the norm of Sub-Gaussian vectors [53]).Letu ∈R s be a centered ...

work page 2020

[63] [63]

λmin ∇2LS(Θα,t) ≥ − cmax√m LS(Θα,t).(19) wherec max =C σ,b p 3 2 p log( m δ ) + √p+ 3 Θ(0)−Θ ∗ 2

Then, from Proposition B.2 we know the least eigenvalue of∇ 2LS(Θα,t) can be controlled as follows. λmin ∇2LS(Θα,t) ≥ − cmax√m LS(Θα,t).(19) wherec max =C σ,b p 3 2 p log( m δ ) + √p+ 3 Θ(0)−Θ ∗ 2 . Then from Taylor’s theorem and (19), we know that there exists a α∈ [0, 1] and the associated Θ α,t such that with probability at least 1−δ, it holds that LS(...

work page

[64] [64]

Then, the above inequality can be further bounded as follows λmax ∇2LS(Θ(t+ 1)) ≤C σ,b p2 p+ B2 b p m ∥Θ(0)−Θ ∗∥4 2 ≤C σ,bp3 =:ρ ℓ, where the last inequality used the condition m≳∥ Θ(0) − Θ∗∥4

work page

[65] [65]

with probability at least 1 −δ

Then we know ℓ(Θ(t + 1)) is ρℓ-strongly smooth through the trajectory of {Θ(t)}t∈[k]. From the descent lemma (see e.g., [ 49]), we know if η≤ 1/ρℓ, it holds for anyt∈[k] that LS(Θ(t+ 1))≤ L S(Θ(t))− η 2 ∇LS(Θ(t)) 2 2.(21) Combining the above inequality and (20), we know LS(Θ(t+ 1)) ≤ L S(Θ∗)− ∇LS(Θ(t)),Θ ∗ −Θ(t) − η 2 ∇LS(Θ(t)) 2 2 + cmax 2√m max α∈[0,1] ...

work page

[66] [66]

with probability at least 1−δ

Then, the conditions m≳ (log(m/δ) + g2(1/T ))g4(1/T ) and η≲min g2(1), g2(1) LS(Θ(0)) −1 ensure that all the conditions in Theorem 4.3 are satisfied. Applying Theorem 4.3 with Θ ∗ = Θ1/T implies that LS(Θ(T))≲ 2ηTL S(Θ1/T ) + ΛΘ1/T ηT ≤ 2η+g 2 1 T ηT . This completes the proof of the Theorem. 24 The proof of Theorem 4.9 is given as follows. Proof of Theor...

work page

[67] [67]

Then, it holds that ES L(Θ(T))− L S(Θ(T)) ≤C σ,bp2 p+p∥Θ ∗ −Θ(0)∥ 2 2 η n ES h TX t=0 LS Θ(t) i

max ∥Θ(0)−Θ ∗∥2 2,∥Θ(0)−Θ ∗∥4 2 . Then, it holds that ES L(Θ(T))− L S(Θ(T)) ≤C σ,bp2 p+p∥Θ ∗ −Θ(0)∥ 2 2 η n ES h TX t=0 LS Θ(t) i . 26 Proof. For any i∈ [n], let {Θi(k)} and {Θ−i(k)} be produced by GD based Si and S−i, respectively. Observe that the condition ∥Θ(0) − Θ∗∥2 2 ≥ 8 max{ηT LS(Θ∗) + L eS(Θ∗) , η LS(Θ(0)) + L eS(Θ(0)) } implies ∥Θ(0) − Θ∗∥2 2 ≥ ...

work page

[68] [68]

By further noting that (7) holds with probability at least 1 −δ over initialization c(0), this completes the proof of the theorem

Combining this with Theorems B.4 and C.4 yields ES L(Θ(T)) ≲ ηΛ2 Θ∗ n ES h TX t=0 LS Θ(t) i +E S LS(Θ(T)) ≲ Λ2 Θ∗ n + 1 ηT C(Θ∗), where C(Θ∗) = ES CS(Θ∗) . By further noting that (7) holds with probability at least 1 −δ over initialization c(0), this completes the proof of the theorem. Theorem C.5(Generalization under Realizability).Suppose Assumption 4.5...

work page

[69] [69]

Note Lemma A.2 implies ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ) with probability at least 1 −δ/ 2

We begin by estimating the ℓ2-sensitivity of ∂aLS(eΘ(k)) and ∂cLS(eΘ(k)) at each iteration k. Note Lemma A.2 implies ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ) with probability at least 1 −δ/ 2. In the following proof, we assume the event {c(0) : ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ)} holds and aim to show Algorithm 1 satisfies (ϵ, δ/2)-DP. Noting thatSandS ′ differ in the fi...

work page

[70] [70]

Combining (37) and the above equality and dividing byηyield g,Θ ∗ −Θ + ≥ 1 2η ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2

Applying this with (a,b,c) = (Θ,Θ +,Θ ∗) gives Θ−Θ +,Θ ∗ −Θ + = 1 2 ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2 . Combining (37) and the above equality and dividing byηyield g,Θ ∗ −Θ + ≥ 1 2η ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2 . Therefore, g,Θ−Θ ∗ = g,Θ−Θ + − g,Θ ∗ −Θ + ≤ g,Θ−Θ + + 1 2η ∥Θ−Θ ∗∥2 2 − ∥Θ+ −Θ ∗∥2 2 − ∥Θ−Θ +∥2 2 ≤ η 2 ∥g∥2 2 + 1 2η ∥Θ−Θ +∥2 2 + ...

work page

[71] [71]

By further using eΛ2 Θ∗ ≥ηL S(eΘ(0)), it holds 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 4 ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dlog(2T /δ) n2ϵ2

It then follows that TX k=1 EA LS(eΘ(k)) ≤4TL S(Θ∗) + 3 2η ∥eΘ(0)−Θ ∗∥2 2 + 9mpηT 2(1 + log(2T /δ) ϵ ) 2n2ϵ C1d+C 2 + 2LS(eΘ(0)).(42) Therefore, 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 3 2ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dlog(2T /δ) n2ϵ2 + 2 T LS(eΘ(0)). By further using eΛ2 Θ∗ ≥ηL S(eΘ(0)), it holds 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 4 ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dl...

work page

[72] [72]

These match the width conditions stated in the theorem

and m≲ (nϵ)4 p10(log( m δ )+R2)d2(ηT) 4 log2(T /δ). These match the width conditions stated in the theorem. Then, from Lemma D.11 we know the algorithmAis on average argument ∥ eΘ(0)−Θ∗∥2 2 n + mp4dη2T 2 log(T /δ) n3ϵ2 -stable. From Lemma B.1, the condition for m and the fact maxi∈[n] ∥ci(k)∥2 ≤ ∥c(k) − c(0)∥2 ≤R 2 ≤R , we know the Lipschitz constant of t...

work page

[73] [73]

This completes the proof of the theorem

≳R 4. This completes the proof of the theorem. Theorem D.13(Restatement of Theorem 4.15).Let the sequence {eΘ(k)}T k=1 be produced by Algorithm 1 with step size η > 0. LetΘ ∗ be an reference point that obtain small training loss satisfying ∥eΘ(0) − Θ∗∥2 2 ≥ Cmax{ηT (LS(Θ∗)+ L eS(Θ∗)), η(LS(eΘ(0))+ L eS(eΘ(0)))}. If η≤min{ 1/3¯ρ,1}, m≳p 2(log( m δ )+ R2)(R...

work page 2048