pith. sign in

arxiv: 2601.22409 · v3 · pith:WMNEPMVCnew · submitted 2026-01-29 · 💻 cs.LG · cs.AI· stat.ML

Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks

Pith reviewed 2026-05-16 09:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords Kolmogorov-Arnold Networksgradient descentdifferential privacygeneralization boundsoptimization ratesnetwork widthlogistic losstwo-layer networks
0
0 comments X

The pith

Gradient descent on two-layer Kolmogorov-Arnold networks reaches optimization and generalization rates of order 1/T and 1/n with polylogarithmic width, and differential privacy makes that width necessary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes gradient descent for two-layer Kolmogorov-Arnold networks and derives bounds that characterize training dynamics, generalization, and utility under differential privacy. Under an NTK-separable assumption with logistic loss, polylogarithmic network width suffices to obtain an optimization rate of order 1/T and a generalization rate of order 1/n. In the private setting the analysis produces a utility bound of order sqrt(d) over n epsilon that matches the classical lower bound for convex Lipschitz problems. This reveals a gap: polylogarithmic width is sufficient without privacy but both sufficient and necessary once differential privacy is imposed.

Core claim

For two-layer KANs trained by gradient descent under the NTK-separable assumption on logistic loss, polylogarithmic network width suffices to achieve an optimization rate of order 1/T after T iterations and a generalization rate of order 1/n for n samples. In the differentially private setting the noise required for (epsilon, delta)-DP yields a utility bound of order sqrt(d)/(n epsilon), matching the lower bound for general convex Lipschitz problems and implying that polylogarithmic width is necessary as well as sufficient under differential privacy.

What carries the argument

Gradient descent dynamics on two-layer KANs specialized under the NTK-separable assumption for logistic loss, which controls the optimization and generalization rates and establishes necessity of polylogarithmic width under differential privacy.

If this is right

  • Polylogarithmic width suffices for GD to converge at rate 1/T.
  • Generalization error scales as 1/n with sample size.
  • Private utility matches the classical sqrt(d) lower bound for convex Lipschitz problems.
  • Polylogarithmic width becomes necessary once differential privacy is required.
  • The derived rates can guide practical choices of network width and early stopping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The private versus non-private width gap may appear in other network families when similar assumptions hold.
  • Practitioners could set minimal widths for private training directly from the polylogarithmic threshold.
  • The tightness to classical lower bounds suggests the analysis captures the essential cost of privacy for this architecture.
  • Width selection informed by these bounds may improve trade-offs between privacy and accuracy in deployed models.

Load-bearing premise

The training data must satisfy an NTK-separable assumption under logistic loss so the general bounds specialize to the fast rates and the privacy necessity result.

What would settle it

Observing that a network with sub-polylogarithmic width achieves the claimed 1/T optimization rate on NTK-separable logistic-loss data, or that private training with such narrow networks exceeds the sqrt(d)/(n epsilon) utility bound, would contradict the sufficiency and necessity claims.

Figures

Figures reproduced from arXiv: 2601.22409 by Junyu Zhou, Marius Kloft, Philipp Liznerski, Puyu Wang.

Figure 1
Figure 1. Figure 1: Comparison of MLP and KAN performance on genomic sequence classification. Each point pair [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Training and test accuracy as a function of the width m (with T fixed) and the number of iterations T (with m fixed) for GD, and (b) private utility as a function of m and T for DP-GD, on synthetic logistic data (top row) and MNIST (bottom row). In (a), the vertical dashed lines indicate empirically observed change points beyond which increasing the width m yields diminishing or flat training and test … view at source ↗
Figure 3
Figure 3. Figure 3: Training and test losses versus m and T. E.3 Experiments The paper shows results for two types of experiments. Loss over m. Here, we vary the model width m and report the resulting training and test metrics of the model for each 50 random seeds on the synthetic data and 20 random seeds on MNIST. Loss over T. Here, we vary the full training dataset iterations T and report the resulting training and test met… view at source ↗
read the original abstract

Kolmogorov--Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(\epsilon,\delta)$-DP and obtain a utility bound of order $\sqrt{d}/(n\epsilon)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper analyzes gradient descent training of two-layer Kolmogorov-Arnold Networks, deriving general bounds on optimization dynamics, generalization, and differential privacy utility. Specializing to logistic loss under an NTK-separable assumption, it claims that polylogarithmic width suffices to achieve O(1/T) optimization and O(1/n) generalization rates. In the private setting it derives a utility bound of order √d/(nε) that matches the classical convex Lipschitz lower bound and argues that polylogarithmic width is also necessary under DP, revealing a qualitative gap between the private and non-private regimes. Experiments are used to illustrate practical guidance on width selection and early stopping.

Significance. If the results hold, the work supplies the first systematic theoretical treatment of optimization, generalization, and privacy for KANs under gradient descent. The matching of the classical DP lower bound and the claimed distinction in width requirements between private and non-private training would be useful for guiding architecture design in this emerging model family. The dependence on the NTK-separable assumption, however, restricts the immediate scope of the rates and the necessity claim.

major comments (3)
  1. [Abstract] Abstract: the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing the general analysis to logistic loss under the NTK-separable assumption; the manuscript does not show that these rates follow from the general bounds without this assumption, making the assumption load-bearing for the central sufficiency claims.
  2. [Abstract] Abstract: the necessity of polylogarithmic width under (ε,δ)-DP is asserted, yet the argument appears to route through the same NTK-separable specialization used for sufficiency; without an independent argument or a demonstration that necessity holds for general convex Lipschitz problems, the claimed qualitative gap between private and non-private regimes is not fully supported.
  3. [Abstract] Abstract: the utility bound of order √d/(nε) is stated to match the classical lower bound, but the derivation of the noise scale required for (ε,δ)-DP and the precise dependence on network width are not exhibited in the provided text, preventing verification that the bound remains tight once the NTK-separable assumption is imposed.
minor comments (1)
  1. The abstract refers to 'two-layer KANs' without specifying the precise form of the univariate functions or the spline basis used; a brief clarification would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments help us clarify the precise scope of our results and the role of the NTK-separable assumption. We address each major comment below and will revise the abstract and introduction for greater precision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing the general analysis to logistic loss under the NTK-separable assumption; the manuscript does not show that these rates follow from the general bounds without this assumption, making the assumption load-bearing for the central sufficiency claims.

    Authors: We agree that the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing to logistic loss under the NTK-separable assumption. The general bounds derived earlier in the paper apply to two-layer KANs without this assumption but yield weaker rates that depend on additional problem parameters. The specialization is explicitly stated in the abstract and is load-bearing for the tight rates; we will revise the abstract to emphasize this distinction more clearly and add a sentence noting that the general bounds serve as the foundation from which the specialized rates are derived. revision: yes

  2. Referee: [Abstract] Abstract: the necessity of polylogarithmic width under (ε,δ)-DP is asserted, yet the argument appears to route through the same NTK-separable specialization used for sufficiency; without an independent argument or a demonstration that necessity holds for general convex Lipschitz problems, the claimed qualitative gap between private and non-private regimes is not fully supported.

    Authors: The necessity claim for polylogarithmic width is established within the same NTK-separable logistic-loss setting used for the sufficiency results. We do not claim or prove necessity for arbitrary convex Lipschitz problems; our focus is on the KAN model class. The qualitative gap we highlight is therefore internal to this setting: polylogarithmic width suffices for the non-private rates, while the same width becomes necessary to achieve the stated utility under DP. We will add a clarifying sentence in the abstract and introduction to restrict the necessity statement to the specialized regime and briefly discuss why the DP analysis forces the width lower bound in this case. revision: partial

  3. Referee: [Abstract] Abstract: the utility bound of order √d/(nε) is stated to match the classical lower bound, but the derivation of the noise scale required for (ε,δ)-DP and the precise dependence on network width are not exhibited in the provided text, preventing verification that the bound remains tight once the NTK-separable assumption is imposed.

    Authors: The noise scale for (ε,δ)-DP is derived in Section 4 by applying the Gaussian mechanism to the per-iteration gradients, with sensitivity bounded using the NTK feature map under the separability assumption. The resulting utility bound of order √d/(nε) is obtained after substituting the polylogarithmic width that suffices for optimization; the bound is independent of width once this threshold is met and matches the classical convex Lipschitz lower bound. The full derivation, including the width dependence in the sensitivity, appears in the main text. We will add a parenthetical reference to Section 4 in the abstract to facilitate verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from explicit NTK-separable assumption and external classical bounds

full rationale

The paper first states general bounds on GD dynamics, generalization, and DP utility for two-layer KANs. It then specializes these bounds to logistic loss under the explicitly declared NTK-separable assumption to obtain the 1/T optimization and 1/n generalization rates with polylog width. The private utility bound of order √d/(nε) is obtained by adding Gaussian noise calibrated to the sensitivity and is shown to match the known classical lower bound for convex Lipschitz problems (an external result, not derived inside the paper). The necessity claim for polylog width under DP likewise rests on this matching to the external convex lower bound rather than on any self-referential construction or fitted parameter. No equation reduces to a prior equation by definition, no parameter is fitted on a data subset and then relabeled a prediction, and no load-bearing step relies on a self-citation whose content is itself unverified. The analysis is therefore self-contained once the assumption is granted, yielding a circularity score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the NTK-separable assumption as a key domain assumption. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption NTK-separable assumption on the data distribution for logistic loss
    This assumption allows the specialization of the general bounds to achieve the stated optimization and generalization rates with polylog width.

pith-pipeline@v0.9.0 · 5549 in / 1372 out tokens · 37851 ms · 2026-05-16T09:31:00.048163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages

  1. [1]

    Deep learning with differential privacy

    Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, 2016

  2. [2]

    A convergence theory for deep learning via over- parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

  3. [3]

    On functions of three variables

    Vladimir Igorevich Arnol’d. On functions of three variables. InDoklady Akademii Nauk, volume 114, pages 679–681. Russian Academy of Sciences, 1957

  4. [4]

    Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

    Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

  5. [5]

    Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017

  6. [6]

    Private stochastic convex optimization with optimal rates

    Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. InAdvances in neural information processing systems, volume 32, 2019

  7. [7]

    Differentially private stochastic optimization: New results in convex and non-convex settings

    Raef Bassily, Crist´ obal Guzm´ an, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. InAdvances in Neural Information Processing Systems, volume 34, pages 9317–9329, 2021

  8. [8]

    Convolutional kolmogorov- arnold networks

    Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau. Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024

  9. [9]

    On a constructive proof of kolmogorov’s superposition theorem

    J¨ urgen Braun and Michael Griebel. On a constructive proof of kolmogorov’s superposition theorem. Constructive approximation, 30(3):653–675, 2009

  10. [10]

    On the convergence and calibration of deep learning with differential privacy.Transactions on machine learning research, 2023:https–openreview, 2023

    Zhiqi Bu, Hua Wang, Zongyu Dai, and Qi Long. On the convergence and calibration of deep learning with differential privacy.Transactions on machine learning research, 2023:https–openreview, 2023

  11. [11]

    Generalization bounds of stochastic gradient descent for wide and deep neural networks

    Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

  12. [12]

    How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

    Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021. 12

  13. [13]

    Poptsova

    Oleksandr Cherednichenko and Maria S. Poptsova. Kolmogorov-arnold networks for genomic tasks. Briefings Bioinform., 26(2), 2025

  14. [14]

    The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

    Li Deng. The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

  15. [15]

    Gradient descent finds global minima of deep neural networks

    Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

  16. [16]

    Calibrating noise to sensitivity in private data analysis

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference, pages 265–284. Springer, 2006

  17. [17]

    The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science, 9(3–4):211–407, 2014

    Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science, 9(3–4):211–407, 2014

  18. [18]

    Kan we improve on hep classification tasks? kolmogorov–arnold networks applied to an lhc physics example.Computing and Software for Big Science, 9(1):9, 2025

    Johannes Erdmann, Florian Mausolf, and Jan Lukas Sp¨ ah. Kan we improve on hep classification tasks? kolmogorov–arnold networks applied to an lhc physics example.Computing and Software for Big Science, 9(1):9, 2025

  19. [19]

    Kanice: Kolmogorov-arnold networks with interactive convolutional elements

    Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, David Dobson, Kendall N Niles, Ken Pathak, and Steven Sloan. Kanice: Kolmogorov-arnold networks with interactive convolutional elements. In Proceedings of the 4th International Conference on AI-ML Systems, pages 1–10, 2024

  20. [20]

    Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

    Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

  21. [21]

    On the convergence of (stochastic) gradient descent for kolmogorov– arnold networks.IEEE Transactions on Information Theory, 2025

    Yihang Gao and Vincent YF Tan. On the convergence of (stochastic) gradient descent for kolmogorov– arnold networks.IEEE Transactions on Information Theory, 2025

  22. [22]

    Genet and H

    Remi Genet and Hugo Inzirillo. A temporal kolmogorov-arnold transformer for time series forecasting. arXiv preprint arXiv:2406.02486, 2024

  23. [23]

    Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

  24. [24]

    Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks

    Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

  25. [25]

    Andre˘ ı Nikolaevich Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.Translations American Mathematical Society, 2(28):55–59, 1963

  26. [26]

    Stability and generalization analysis of gradient methods for shallow neural networks

    Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

  27. [27]

    Fine-grained analysis of stability and generalization for stochastic gradient descent

    Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, pages 5809–5819, 2020

  28. [28]

    Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7:1346–1354, 08 2025

    Longlong Li, Yipeng Zhang, Guanghui Wang, and Kelin Xia. Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7:1346–1354, 08 2025

  29. [29]

    Generalization bounds for kolmogorov- arnold networks (kans) and enhanced kans with lower lipschitz complexity

    Pengqi Li, Lizhong Ding, Jiarun Fu, Guoren Wang, Ye Yuan, et al. Generalization bounds for kolmogorov- arnold networks (kans) and enhanced kans with lower lipschitz complexity. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  30. [30]

    On the rate of convergence of K olmogorov-Arnold Network regression estimators,

    Wei Liu, Eleni Chatzi, and Zhilu Lai. On the rate of convergence of kolmogorov-arnold network regression estimators.arXiv preprint arXiv:2509.19830, 2025. 13

  31. [31]

    Kan: Kolmogorov-arnold networks

    Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks. InInternational Conference on Learning Representations, 2025

  32. [32]

    How to make the gradients small privately: Improved rates for differentially private non-convex optimization

    Andrew Lowy, Jonathan Ullman, and Stephen J Wright. How to make the gradients small privately: Improved rates for differentially private non-convex optimization. InInternational Conference on Machine Learning, 2024

  33. [33]

    R´ enyi differential privacy

    Ilya Mironov. R´ enyi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017

  34. [34]

    How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

    Mike Nguyen and Nicole M¨ ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

  35. [35]

    Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

    Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

  36. [36]

    Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

    Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021

  37. [37]

    Overparameterized nonlinear learning: Gradient descent takes the shortest path? InInternational Conference on Machine Learning, pages 4951–4960

    Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? InInternational Conference on Machine Learning, pages 4951–4960. PMLR, 2019

  38. [38]

    Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

    Subhajit Patra, Sonali Panda, Bikram Keshari Parida, Mahima Arya, Kurt Jacobs, Denys I Bondar, and Abhijit Sen. Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

  39. [39]

    Finding local diffusion schrodinger bridge using kolmogorov-arnold network

    Xingyu Qiu, Mengying Yang, Xinghua Ma, Fanding Li, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, and Shuo Li. Finding local diffusion schrodinger bridge using kolmogorov-arnold network. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23227–23236, 2025

  40. [40]

    GI NN-KAN: Interpretability pipelining with applications in physics-informed neural networks,

    Nisal Ranasinghe, Yu Xia, Sachith Seneviratne, and Saman Halgamuge. Ginn-kan: Interpretability pipelining with applications in physics informed neural networks.arXiv preprint arXiv:2408.14780, 2024

  41. [41]

    Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel

    Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021

  42. [42]

    Convex approximation of two-layer relu networks for hidden state differential privacy

    Rob Romijnders and Antti Koskela. Convex approximation of two-layer relu networks for hidden state differential privacy. InAdvances in Neural Information Processing Systems, 2025

  43. [43]

    Stability vs implicit bias of gradient methods on separable data and beyond

    Matan Schliserman and Tomer Koren. Stability vs implicit bias of gradient methods on separable data and beyond. InConference on Learning Theory, pages 3380–3394. PMLR, 2022

  44. [44]

    Towards understanding generalization in dp-gd: A case study in training two-layer cnns

    Zhongjie Shi, Puyu Wang, Chenyang Zhang, and Yuan Cao. Towards understanding generalization in dp-gd: A case study in training two-layer cnns. InThe Fortieth AAAI Conference on Artificial Intelligence, 2026

  45. [45]

    Khemraj Shukla, Juan Diego Toscano, Zhicheng Wang, Zongren Zou, and George Em Karniadakis. A comprehensive and fair comparison between mlp and kan representations for differential equations and operator networks.Computer Methods in Applied Mechanics and Engineering, 431:117290, 2024

  46. [46]

    A survey on kolmogorov-arnold network.ACM Computing Surveys, 58(2):1–35, 2025

    Shriyank Somvanshi, Syed Aaqib Javed, Md Monzurul Islam, Diwas Pandit, and Subasish Das. A survey on kolmogorov-arnold network.ACM Computing Surveys, 58(2):1–35, 2025

  47. [47]

    Stochastic gradient descent with differentially private updates

    Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013. 14

  48. [48]

    Kan-ddpm: Kolmogorov-arnold networks with diffusion denoising probabilistic models for mri-to-ct synthesis

    Vanessa Su and Xiaofeng Yang. Kan-ddpm: Kolmogorov-arnold networks with diffusion denoising probabilistic models for mri-to-ct synthesis. InMedical Imaging 2025: Imaging Informatics, volume 13411, pages 227–233. SPIE, 2025

  49. [49]

    Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

    Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

  50. [50]

    Sharper guarantees for learning neural network classifiers with gradient methods

    Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025

  51. [51]

    Kolmogorov-arnold networks (kans) for time series analysis

    Cristian J Vaca-Rubio, Luis Blanco, Roberto Pereira, and M` arius Caus. Kolmogorov-arnold networks (kans) for time series analysis.arXiv preprint arXiv:2405.08790, 2024

  52. [52]

    Cambridge university press, 2018

    Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

  53. [53]

    Cambridge university press, 2019

    Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

  54. [54]

    Differentially private empirical risk minimization with non-convex loss functions

    Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. InInternational Conference on Machine Learning, pages 6526–6535. PMLR, 2019

  55. [55]

    Optimal utility bounds for differentially private gradient descent in three-layer neural networks

    Puyu Wang, Yunwen Lei, Marius Kloft, and Yiming Ying. Optimal utility bounds for differentially private gradient descent in three-layer neural networks. In2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–8. IEEE, 2025

  56. [56]

    Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

    Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

  57. [57]

    On the expressiveness and spectral bias of kans

    Yixuan Wang, Jonathan W Siegel, Ziming Liu, and Thomas Y Hou. On the expressiveness and spectral bias of kans. InInternational Conference on Learning Representations, 2025

  58. [58]

    Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks

    Yizheng Wang, Jia Sun, Jinshuai Bai, Cosmin Anitescu, Mohammad Sadegh Eshaghi, Xiaoying Zhuang, Timon Rabczuk, and Yinghua Liu. Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks. Computer Methods in Applied Mechanics and Engineering, 433:117518, 2025

  59. [59]

    A conditional kan diffusion network for human activity recognition with missing sensor signal series

    Hao Xiong, Jiayi Gong, Haiyong Luo, Fang Zhao, Yang Gao, Runze Chen, and Mingyu Xiao. A conditional kan diffusion network for human activity recognition with missing sensor signal series. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  60. [60]

    Efficient private erm for smooth objectives

    Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private erm for smooth objectives. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3922–3928, 2017

  61. [61]

    Generalization bounds and model complexity for kolmogorov-arnold networks

    Xianyang Zhang and Huijuan Zhou. Generalization bounds and model complexity for kolmogorov-arnold networks. InThe Thirteenth International Conference on Learning Representations, 2025

  62. [62]

    with probability at least 1−δ

    Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 15 Appendix A Useful Lemmas In this section, we introduce several useful lemmas that will be used in the proofs. Lemma A.1(Concentration of the norm of Sub-Gaussian vectors [53]).Letu ∈R s be a centered ...

  63. [63]

    λmin ∇2LS(Θα,t) ≥ − cmax√m LS(Θα,t).(19) wherec max =C σ,b p 3 2 p log( m δ ) + √p+ 3 Θ(0)−Θ ∗ 2

    Then, from Proposition B.2 we know the least eigenvalue of∇ 2LS(Θα,t) can be controlled as follows. λmin ∇2LS(Θα,t) ≥ − cmax√m LS(Θα,t).(19) wherec max =C σ,b p 3 2 p log( m δ ) + √p+ 3 Θ(0)−Θ ∗ 2 . Then from Taylor’s theorem and (19), we know that there exists a α∈ [0, 1] and the associated Θ α,t such that with probability at least 1−δ, it holds that LS(...

  64. [64]

    Then, the above inequality can be further bounded as follows λmax ∇2LS(Θ(t+ 1)) ≤C σ,b p2 p+ B2 b p m ∥Θ(0)−Θ ∗∥4 2 ≤C σ,bp3 =:ρ ℓ, where the last inequality used the condition m≳∥ Θ(0) − Θ∗∥4

  65. [65]

    with probability at least 1 −δ

    Then we know ℓ(Θ(t + 1)) is ρℓ-strongly smooth through the trajectory of {Θ(t)}t∈[k]. From the descent lemma (see e.g., [ 49]), we know if η≤ 1/ρℓ, it holds for anyt∈[k] that LS(Θ(t+ 1))≤ L S(Θ(t))− η 2 ∇LS(Θ(t)) 2 2.(21) Combining the above inequality and (20), we know LS(Θ(t+ 1)) ≤ L S(Θ∗)− ∇LS(Θ(t)),Θ ∗ −Θ(t) − η 2 ∇LS(Θ(t)) 2 2 + cmax 2√m max α∈[0,1] ...

  66. [66]

    with probability at least 1−δ

    Then, the conditions m≳ (log(m/δ) + g2(1/T ))g4(1/T ) and η≲min g2(1), g2(1) LS(Θ(0)) −1 ensure that all the conditions in Theorem 4.3 are satisfied. Applying Theorem 4.3 with Θ ∗ = Θ1/T implies that LS(Θ(T))≲ 2ηTL S(Θ1/T ) + ΛΘ1/T ηT ≤ 2η+g 2 1 T ηT . This completes the proof of the Theorem. 24 The proof of Theorem 4.9 is given as follows. Proof of Theor...

  67. [67]

    Then, it holds that ES L(Θ(T))− L S(Θ(T)) ≤C σ,bp2 p+p∥Θ ∗ −Θ(0)∥ 2 2 η n ES h TX t=0 LS Θ(t) i

    max ∥Θ(0)−Θ ∗∥2 2,∥Θ(0)−Θ ∗∥4 2 . Then, it holds that ES L(Θ(T))− L S(Θ(T)) ≤C σ,bp2 p+p∥Θ ∗ −Θ(0)∥ 2 2 η n ES h TX t=0 LS Θ(t) i . 26 Proof. For any i∈ [n], let {Θi(k)} and {Θ−i(k)} be produced by GD based Si and S−i, respectively. Observe that the condition ∥Θ(0) − Θ∗∥2 2 ≥ 8 max{ηT LS(Θ∗) + L eS(Θ∗) , η LS(Θ(0)) + L eS(Θ(0)) } implies ∥Θ(0) − Θ∗∥2 2 ≥ ...

  68. [68]

    By further noting that (7) holds with probability at least 1 −δ over initialization c(0), this completes the proof of the theorem

    Combining this with Theorems B.4 and C.4 yields ES L(Θ(T)) ≲ ηΛ2 Θ∗ n ES h TX t=0 LS Θ(t) i +E S LS(Θ(T)) ≲ Λ2 Θ∗ n + 1 ηT C(Θ∗), where C(Θ∗) = ES CS(Θ∗) . By further noting that (7) holds with probability at least 1 −δ over initialization c(0), this completes the proof of the theorem. Theorem C.5(Generalization under Realizability).Suppose Assumption 4.5...

  69. [69]

    Note Lemma A.2 implies ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ) with probability at least 1 −δ/ 2

    We begin by estimating the ℓ2-sensitivity of ∂aLS(eΘ(k)) and ∂cLS(eΘ(k)) at each iteration k. Note Lemma A.2 implies ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ) with probability at least 1 −δ/ 2. In the following proof, we assume the event {c(0) : ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ)} holds and aim to show Algorithm 1 satisfies (ϵ, δ/2)-DP. Noting thatSandS ′ differ in the fi...

  70. [70]

    Combining (37) and the above equality and dividing byηyield g,Θ ∗ −Θ + ≥ 1 2η ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2

    Applying this with (a,b,c) = (Θ,Θ +,Θ ∗) gives Θ−Θ +,Θ ∗ −Θ + = 1 2 ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2 . Combining (37) and the above equality and dividing byηyield g,Θ ∗ −Θ + ≥ 1 2η ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2 . Therefore, g,Θ−Θ ∗ = g,Θ−Θ + − g,Θ ∗ −Θ + ≤ g,Θ−Θ + + 1 2η ∥Θ−Θ ∗∥2 2 − ∥Θ+ −Θ ∗∥2 2 − ∥Θ−Θ +∥2 2 ≤ η 2 ∥g∥2 2 + 1 2η ∥Θ−Θ +∥2 2 + ...

  71. [71]

    By further using eΛ2 Θ∗ ≥ηL S(eΘ(0)), it holds 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 4 ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dlog(2T /δ) n2ϵ2

    It then follows that TX k=1 EA LS(eΘ(k)) ≤4TL S(Θ∗) + 3 2η ∥eΘ(0)−Θ ∗∥2 2 + 9mpηT 2(1 + log(2T /δ) ϵ ) 2n2ϵ C1d+C 2 + 2LS(eΘ(0)).(42) Therefore, 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 3 2ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dlog(2T /δ) n2ϵ2 + 2 T LS(eΘ(0)). By further using eΛ2 Θ∗ ≥ηL S(eΘ(0)), it holds 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 4 ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dl...

  72. [72]

    These match the width conditions stated in the theorem

    and m≲ (nϵ)4 p10(log( m δ )+R2)d2(ηT) 4 log2(T /δ). These match the width conditions stated in the theorem. Then, from Lemma D.11 we know the algorithmAis on average argument ∥ eΘ(0)−Θ∗∥2 2 n + mp4dη2T 2 log(T /δ) n3ϵ2 -stable. From Lemma B.1, the condition for m and the fact maxi∈[n] ∥ci(k)∥2 ≤ ∥c(k) − c(0)∥2 ≤R 2 ≤R , we know the Lipschitz constant of t...

  73. [73]

    This completes the proof of the theorem

    ≳R 4. This completes the proof of the theorem. Theorem D.13(Restatement of Theorem 4.15).Let the sequence {eΘ(k)}T k=1 be produced by Algorithm 1 with step size η > 0. LetΘ ∗ be an reference point that obtain small training loss satisfying ∥eΘ(0) − Θ∗∥2 2 ≥ Cmax{ηT (LS(Θ∗)+ L eS(Θ∗)), η(LS(eΘ(0))+ L eS(eΘ(0)))}. If η≤min{ 1/3¯ρ,1}, m≳p 2(log( m δ )+ R2)(R...