Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks
Pith reviewed 2026-05-16 09:31 UTC · model grok-4.3
The pith
Gradient descent on two-layer Kolmogorov-Arnold networks reaches optimization and generalization rates of order 1/T and 1/n with polylogarithmic width, and differential privacy makes that width necessary.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For two-layer KANs trained by gradient descent under the NTK-separable assumption on logistic loss, polylogarithmic network width suffices to achieve an optimization rate of order 1/T after T iterations and a generalization rate of order 1/n for n samples. In the differentially private setting the noise required for (epsilon, delta)-DP yields a utility bound of order sqrt(d)/(n epsilon), matching the lower bound for general convex Lipschitz problems and implying that polylogarithmic width is necessary as well as sufficient under differential privacy.
What carries the argument
Gradient descent dynamics on two-layer KANs specialized under the NTK-separable assumption for logistic loss, which controls the optimization and generalization rates and establishes necessity of polylogarithmic width under differential privacy.
If this is right
- Polylogarithmic width suffices for GD to converge at rate 1/T.
- Generalization error scales as 1/n with sample size.
- Private utility matches the classical sqrt(d) lower bound for convex Lipschitz problems.
- Polylogarithmic width becomes necessary once differential privacy is required.
- The derived rates can guide practical choices of network width and early stopping.
Where Pith is reading between the lines
- The private versus non-private width gap may appear in other network families when similar assumptions hold.
- Practitioners could set minimal widths for private training directly from the polylogarithmic threshold.
- The tightness to classical lower bounds suggests the analysis captures the essential cost of privacy for this architecture.
- Width selection informed by these bounds may improve trade-offs between privacy and accuracy in deployed models.
Load-bearing premise
The training data must satisfy an NTK-separable assumption under logistic loss so the general bounds specialize to the fast rates and the privacy necessity result.
What would settle it
Observing that a network with sub-polylogarithmic width achieves the claimed 1/T optimization rate on NTK-separable logistic-loss data, or that private training with such narrow networks exceeds the sqrt(d)/(n epsilon) utility bound, would contradict the sufficiency and necessity claims.
Figures
read the original abstract
Kolmogorov--Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(\epsilon,\delta)$-DP and obtain a utility bound of order $\sqrt{d}/(n\epsilon)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes gradient descent training of two-layer Kolmogorov-Arnold Networks, deriving general bounds on optimization dynamics, generalization, and differential privacy utility. Specializing to logistic loss under an NTK-separable assumption, it claims that polylogarithmic width suffices to achieve O(1/T) optimization and O(1/n) generalization rates. In the private setting it derives a utility bound of order √d/(nε) that matches the classical convex Lipschitz lower bound and argues that polylogarithmic width is also necessary under DP, revealing a qualitative gap between the private and non-private regimes. Experiments are used to illustrate practical guidance on width selection and early stopping.
Significance. If the results hold, the work supplies the first systematic theoretical treatment of optimization, generalization, and privacy for KANs under gradient descent. The matching of the classical DP lower bound and the claimed distinction in width requirements between private and non-private training would be useful for guiding architecture design in this emerging model family. The dependence on the NTK-separable assumption, however, restricts the immediate scope of the rates and the necessity claim.
major comments (3)
- [Abstract] Abstract: the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing the general analysis to logistic loss under the NTK-separable assumption; the manuscript does not show that these rates follow from the general bounds without this assumption, making the assumption load-bearing for the central sufficiency claims.
- [Abstract] Abstract: the necessity of polylogarithmic width under (ε,δ)-DP is asserted, yet the argument appears to route through the same NTK-separable specialization used for sufficiency; without an independent argument or a demonstration that necessity holds for general convex Lipschitz problems, the claimed qualitative gap between private and non-private regimes is not fully supported.
- [Abstract] Abstract: the utility bound of order √d/(nε) is stated to match the classical lower bound, but the derivation of the noise scale required for (ε,δ)-DP and the precise dependence on network width are not exhibited in the provided text, preventing verification that the bound remains tight once the NTK-separable assumption is imposed.
minor comments (1)
- The abstract refers to 'two-layer KANs' without specifying the precise form of the univariate functions or the spline basis used; a brief clarification would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed report. The comments help us clarify the precise scope of our results and the role of the NTK-separable assumption. We address each major comment below and will revise the abstract and introduction for greater precision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing the general analysis to logistic loss under the NTK-separable assumption; the manuscript does not show that these rates follow from the general bounds without this assumption, making the assumption load-bearing for the central sufficiency claims.
Authors: We agree that the O(1/T) optimization and O(1/n) generalization rates are obtained only after specializing to logistic loss under the NTK-separable assumption. The general bounds derived earlier in the paper apply to two-layer KANs without this assumption but yield weaker rates that depend on additional problem parameters. The specialization is explicitly stated in the abstract and is load-bearing for the tight rates; we will revise the abstract to emphasize this distinction more clearly and add a sentence noting that the general bounds serve as the foundation from which the specialized rates are derived. revision: yes
-
Referee: [Abstract] Abstract: the necessity of polylogarithmic width under (ε,δ)-DP is asserted, yet the argument appears to route through the same NTK-separable specialization used for sufficiency; without an independent argument or a demonstration that necessity holds for general convex Lipschitz problems, the claimed qualitative gap between private and non-private regimes is not fully supported.
Authors: The necessity claim for polylogarithmic width is established within the same NTK-separable logistic-loss setting used for the sufficiency results. We do not claim or prove necessity for arbitrary convex Lipschitz problems; our focus is on the KAN model class. The qualitative gap we highlight is therefore internal to this setting: polylogarithmic width suffices for the non-private rates, while the same width becomes necessary to achieve the stated utility under DP. We will add a clarifying sentence in the abstract and introduction to restrict the necessity statement to the specialized regime and briefly discuss why the DP analysis forces the width lower bound in this case. revision: partial
-
Referee: [Abstract] Abstract: the utility bound of order √d/(nε) is stated to match the classical lower bound, but the derivation of the noise scale required for (ε,δ)-DP and the precise dependence on network width are not exhibited in the provided text, preventing verification that the bound remains tight once the NTK-separable assumption is imposed.
Authors: The noise scale for (ε,δ)-DP is derived in Section 4 by applying the Gaussian mechanism to the per-iteration gradients, with sensitivity bounded using the NTK feature map under the separability assumption. The resulting utility bound of order √d/(nε) is obtained after substituting the polylogarithmic width that suffices for optimization; the bound is independent of width once this threshold is met and matches the classical convex Lipschitz lower bound. The full derivation, including the width dependence in the sensitivity, appears in the main text. We will add a parenthetical reference to Section 4 in the abstract to facilitate verification. revision: yes
Circularity Check
No significant circularity; derivation proceeds from explicit NTK-separable assumption and external classical bounds
full rationale
The paper first states general bounds on GD dynamics, generalization, and DP utility for two-layer KANs. It then specializes these bounds to logistic loss under the explicitly declared NTK-separable assumption to obtain the 1/T optimization and 1/n generalization rates with polylog width. The private utility bound of order √d/(nε) is obtained by adding Gaussian noise calibrated to the sensitivity and is shown to match the known classical lower bound for convex Lipschitz problems (an external result, not derived inside the paper). The necessity claim for polylog width under DP likewise rests on this matching to the external convex lower bound rather than on any self-referential construction or fitted parameter. No equation reduces to a prior equation by definition, no parameter is fitted on a data subset and then relabeled a prediction, and no load-bearing step relies on a self-citation whose content is itself unverified. The analysis is therefore self-contained once the assumption is granted, yielding a circularity score of 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NTK-separable assumption on the data distribution for logistic loss
Reference graph
Works this paper leans on
-
[1]
Deep learning with differential privacy
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, 2016
work page 2016
-
[2]
A convergence theory for deep learning via over- parameterization
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019
work page 2019
-
[3]
On functions of three variables
Vladimir Igorevich Arnol’d. On functions of three variables. InDoklady Akademii Nauk, volume 114, pages 679–681. Russian Academy of Sciences, 1957
work page 1957
-
[4]
Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019
work page 2019
-
[5]
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in Neural Information Processing Systems, 30, 2017
work page 2017
-
[6]
Private stochastic convex optimization with optimal rates
Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. InAdvances in neural information processing systems, volume 32, 2019
work page 2019
-
[7]
Differentially private stochastic optimization: New results in convex and non-convex settings
Raef Bassily, Crist´ obal Guzm´ an, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. InAdvances in Neural Information Processing Systems, volume 34, pages 9317–9329, 2021
work page 2021
-
[8]
Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024
Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau. Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024
-
[9]
On a constructive proof of kolmogorov’s superposition theorem
J¨ urgen Braun and Michael Griebel. On a constructive proof of kolmogorov’s superposition theorem. Constructive approximation, 30(3):653–675, 2009
work page 2009
-
[10]
Zhiqi Bu, Hua Wang, Zongyu Dai, and Qi Long. On the convergence and calibration of deep learning with differential privacy.Transactions on machine learning research, 2023:https–openreview, 2023
work page 2023
-
[11]
Generalization bounds of stochastic gradient descent for wide and deep neural networks
Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[12]
Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021. 12
work page 2021
- [13]
-
[14]
Li Deng. The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012
work page 2012
-
[15]
Gradient descent finds global minima of deep neural networks
Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019
work page 2019
-
[16]
Calibrating noise to sensitivity in private data analysis
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference, pages 265–284. Springer, 2006
work page 2006
-
[17]
Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.Foundations and Trends®in Theoretical Computer Science, 9(3–4):211–407, 2014
work page 2014
-
[18]
Johannes Erdmann, Florian Mausolf, and Jan Lukas Sp¨ ah. Kan we improve on hep classification tasks? kolmogorov–arnold networks applied to an lhc physics example.Computing and Software for Big Science, 9(1):9, 2025
work page 2025
-
[19]
Kanice: Kolmogorov-arnold networks with interactive convolutional elements
Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, David Dobson, Kendall N Niles, Ken Pathak, and Steven Sloan. Kanice: Kolmogorov-arnold networks with interactive convolutional elements. In Proceedings of the 4th International Conference on AI-ML Systems, pages 1–10, 2024
work page 2024
-
[20]
Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023
work page 2023
-
[21]
Yihang Gao and Vincent YF Tan. On the convergence of (stochastic) gradient descent for kolmogorov– arnold networks.IEEE Transactions on Information Theory, 2025
work page 2025
-
[22]
A temporal Kolmogorov-Arnold tran sformer for time series forecasting,
Remi Genet and Hugo Inzirillo. A temporal kolmogorov-arnold transformer for time series forecasting. arXiv preprint arXiv:2406.02486, 2024
-
[23]
Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[24]
Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2020
work page 2020
-
[25]
Andre˘ ı Nikolaevich Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.Translations American Mathematical Society, 2(28):55–59, 1963
work page 1963
-
[26]
Stability and generalization analysis of gradient methods for shallow neural networks
Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022
work page 2022
-
[27]
Fine-grained analysis of stability and generalization for stochastic gradient descent
Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, pages 5809–5819, 2020
work page 2020
-
[28]
Longlong Li, Yipeng Zhang, Guanghui Wang, and Kelin Xia. Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7:1346–1354, 08 2025
work page 2025
-
[29]
Pengqi Li, Lizhong Ding, Jiarun Fu, Guoren Wang, Ye Yuan, et al. Generalization bounds for kolmogorov- arnold networks (kans) and enhanced kans with lower lipschitz complexity. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[30]
On the rate of convergence of K olmogorov-Arnold Network regression estimators,
Wei Liu, Eleni Chatzi, and Zhilu Lai. On the rate of convergence of kolmogorov-arnold network regression estimators.arXiv preprint arXiv:2509.19830, 2025. 13
-
[31]
Kan: Kolmogorov-arnold networks
Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks. InInternational Conference on Learning Representations, 2025
work page 2025
-
[32]
Andrew Lowy, Jonathan Ullman, and Stephen J Wright. How to make the gradients small privately: Improved rates for differentially private non-convex optimization. InInternational Conference on Machine Learning, 2024
work page 2024
-
[33]
Ilya Mironov. R´ enyi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017
work page 2017
-
[34]
Mike Nguyen and Nicole M¨ ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024
work page 2024
-
[35]
Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019
-
[36]
Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime
Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021
work page 2021
-
[37]
Samet Oymak and Mahdi Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? InInternational Conference on Machine Learning, pages 4951–4960. PMLR, 2019
work page 2019
-
[38]
Subhajit Patra, Sonali Panda, Bikram Keshari Parida, Mahima Arya, Kurt Jacobs, Denys I Bondar, and Abhijit Sen. Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025
work page 2025
-
[39]
Finding local diffusion schrodinger bridge using kolmogorov-arnold network
Xingyu Qiu, Mengying Yang, Xinghua Ma, Fanding Li, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, and Shuo Li. Finding local diffusion schrodinger bridge using kolmogorov-arnold network. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23227–23236, 2025
work page 2025
-
[40]
GI NN-KAN: Interpretability pipelining with applications in physics-informed neural networks,
Nisal Ranasinghe, Yu Xia, Sachith Seneviratne, and Saman Halgamuge. Ginn-kan: Interpretability pipelining with applications in physics informed neural networks.arXiv preprint arXiv:2408.14780, 2024
-
[41]
Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021
work page 2021
-
[42]
Convex approximation of two-layer relu networks for hidden state differential privacy
Rob Romijnders and Antti Koskela. Convex approximation of two-layer relu networks for hidden state differential privacy. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[43]
Stability vs implicit bias of gradient methods on separable data and beyond
Matan Schliserman and Tomer Koren. Stability vs implicit bias of gradient methods on separable data and beyond. InConference on Learning Theory, pages 3380–3394. PMLR, 2022
work page 2022
-
[44]
Towards understanding generalization in dp-gd: A case study in training two-layer cnns
Zhongjie Shi, Puyu Wang, Chenyang Zhang, and Yuan Cao. Towards understanding generalization in dp-gd: A case study in training two-layer cnns. InThe Fortieth AAAI Conference on Artificial Intelligence, 2026
work page 2026
-
[45]
Khemraj Shukla, Juan Diego Toscano, Zhicheng Wang, Zongren Zou, and George Em Karniadakis. A comprehensive and fair comparison between mlp and kan representations for differential equations and operator networks.Computer Methods in Applied Mechanics and Engineering, 431:117290, 2024
work page 2024
-
[46]
A survey on kolmogorov-arnold network.ACM Computing Surveys, 58(2):1–35, 2025
Shriyank Somvanshi, Syed Aaqib Javed, Md Monzurul Islam, Diwas Pandit, and Subasish Das. A survey on kolmogorov-arnold network.ACM Computing Surveys, 58(2):1–35, 2025
work page 2025
-
[47]
Stochastic gradient descent with differentially private updates
Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013. 14
work page 2013
-
[48]
Vanessa Su and Xiaofeng Yang. Kan-ddpm: Kolmogorov-arnold networks with diffusion denoising probabilistic models for mri-to-ct synthesis. InMedical Imaging 2025: Imaging Informatics, volume 13411, pages 227–233. SPIE, 2025
work page 2025
-
[49]
Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024
work page 2024
-
[50]
Sharper guarantees for learning neural network classifiers with gradient methods
Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025
work page 2025
-
[51]
Kolmogorov- arnold networks (kans) for time series analysis,
Cristian J Vaca-Rubio, Luis Blanco, Roberto Pereira, and M` arius Caus. Kolmogorov-arnold networks (kans) for time series analysis.arXiv preprint arXiv:2405.08790, 2024
-
[52]
Cambridge university press, 2018
Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018
work page 2018
-
[53]
Cambridge university press, 2019
Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019
work page 2019
-
[54]
Differentially private empirical risk minimization with non-convex loss functions
Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. InInternational Conference on Machine Learning, pages 6526–6535. PMLR, 2019
work page 2019
-
[55]
Optimal utility bounds for differentially private gradient descent in three-layer neural networks
Puyu Wang, Yunwen Lei, Marius Kloft, and Yiming Ying. Optimal utility bounds for differentially private gradient descent in three-layer neural networks. In2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–8. IEEE, 2025
work page 2025
-
[56]
Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025
work page 2025
-
[57]
On the expressiveness and spectral bias of kans
Yixuan Wang, Jonathan W Siegel, Ziming Liu, and Thomas Y Hou. On the expressiveness and spectral bias of kans. InInternational Conference on Learning Representations, 2025
work page 2025
-
[58]
Yizheng Wang, Jia Sun, Jinshuai Bai, Cosmin Anitescu, Mohammad Sadegh Eshaghi, Xiaoying Zhuang, Timon Rabczuk, and Yinghua Liu. Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks. Computer Methods in Applied Mechanics and Engineering, 433:117518, 2025
work page 2025
-
[59]
A conditional kan diffusion network for human activity recognition with missing sensor signal series
Hao Xiong, Jiayi Gong, Haiyong Luo, Fang Zhao, Yang Gao, Runze Chen, and Mingyu Xiao. A conditional kan diffusion network for human activity recognition with missing sensor signal series. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025
work page 2025
-
[60]
Efficient private erm for smooth objectives
Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private erm for smooth objectives. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3922–3928, 2017
work page 2017
-
[61]
Generalization bounds and model complexity for kolmogorov-arnold networks
Xianyang Zhang and Huijuan Zhou. Generalization bounds and model complexity for kolmogorov-arnold networks. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[62]
Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 15 Appendix A Useful Lemmas In this section, we introduce several useful lemmas that will be used in the proofs. Lemma A.1(Concentration of the norm of Sub-Gaussian vectors [53]).Letu ∈R s be a centered ...
work page 2020
-
[63]
λmin ∇2LS(Θα,t) ≥ − cmax√m LS(Θα,t).(19) wherec max =C σ,b p 3 2 p log( m δ ) + √p+ 3 Θ(0)−Θ ∗ 2
Then, from Proposition B.2 we know the least eigenvalue of∇ 2LS(Θα,t) can be controlled as follows. λmin ∇2LS(Θα,t) ≥ − cmax√m LS(Θα,t).(19) wherec max =C σ,b p 3 2 p log( m δ ) + √p+ 3 Θ(0)−Θ ∗ 2 . Then from Taylor’s theorem and (19), we know that there exists a α∈ [0, 1] and the associated Θ α,t such that with probability at least 1−δ, it holds that LS(...
-
[64]
Then, the above inequality can be further bounded as follows λmax ∇2LS(Θ(t+ 1)) ≤C σ,b p2 p+ B2 b p m ∥Θ(0)−Θ ∗∥4 2 ≤C σ,bp3 =:ρ ℓ, where the last inequality used the condition m≳∥ Θ(0) − Θ∗∥4
-
[65]
with probability at least 1 −δ
Then we know ℓ(Θ(t + 1)) is ρℓ-strongly smooth through the trajectory of {Θ(t)}t∈[k]. From the descent lemma (see e.g., [ 49]), we know if η≤ 1/ρℓ, it holds for anyt∈[k] that LS(Θ(t+ 1))≤ L S(Θ(t))− η 2 ∇LS(Θ(t)) 2 2.(21) Combining the above inequality and (20), we know LS(Θ(t+ 1)) ≤ L S(Θ∗)− ∇LS(Θ(t)),Θ ∗ −Θ(t) − η 2 ∇LS(Θ(t)) 2 2 + cmax 2√m max α∈[0,1] ...
-
[66]
Then, the conditions m≳ (log(m/δ) + g2(1/T ))g4(1/T ) and η≲min g2(1), g2(1) LS(Θ(0)) −1 ensure that all the conditions in Theorem 4.3 are satisfied. Applying Theorem 4.3 with Θ ∗ = Θ1/T implies that LS(Θ(T))≲ 2ηTL S(Θ1/T ) + ΛΘ1/T ηT ≤ 2η+g 2 1 T ηT . This completes the proof of the Theorem. 24 The proof of Theorem 4.9 is given as follows. Proof of Theor...
-
[67]
Then, it holds that ES L(Θ(T))− L S(Θ(T)) ≤C σ,bp2 p+p∥Θ ∗ −Θ(0)∥ 2 2 η n ES h TX t=0 LS Θ(t) i
max ∥Θ(0)−Θ ∗∥2 2,∥Θ(0)−Θ ∗∥4 2 . Then, it holds that ES L(Θ(T))− L S(Θ(T)) ≤C σ,bp2 p+p∥Θ ∗ −Θ(0)∥ 2 2 η n ES h TX t=0 LS Θ(t) i . 26 Proof. For any i∈ [n], let {Θi(k)} and {Θ−i(k)} be produced by GD based Si and S−i, respectively. Observe that the condition ∥Θ(0) − Θ∗∥2 2 ≥ 8 max{ηT LS(Θ∗) + L eS(Θ∗) , η LS(Θ(0)) + L eS(Θ(0)) } implies ∥Θ(0) − Θ∗∥2 2 ≥ ...
-
[68]
Combining this with Theorems B.4 and C.4 yields ES L(Θ(T)) ≲ ηΛ2 Θ∗ n ES h TX t=0 LS Θ(t) i +E S LS(Θ(T)) ≲ Λ2 Θ∗ n + 1 ηT C(Θ∗), where C(Θ∗) = ES CS(Θ∗) . By further noting that (7) holds with probability at least 1 −δ over initialization c(0), this completes the proof of the theorem. Theorem C.5(Generalization under Realizability).Suppose Assumption 4.5...
-
[69]
Note Lemma A.2 implies ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ) with probability at least 1 −δ/ 2
We begin by estimating the ℓ2-sensitivity of ∂aLS(eΘ(k)) and ∂cLS(eΘ(k)) at each iteration k. Note Lemma A.2 implies ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ) with probability at least 1 −δ/ 2. In the following proof, we assume the event {c(0) : ∥c(0)∥2 ≤ 4√pm + 2 p log(2/δ)} holds and aim to show Algorithm 1 satisfies (ϵ, δ/2)-DP. Noting thatSandS ′ differ in the fi...
-
[70]
Applying this with (a,b,c) = (Θ,Θ +,Θ ∗) gives Θ−Θ +,Θ ∗ −Θ + = 1 2 ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2 . Combining (37) and the above equality and dividing byηyield g,Θ ∗ −Θ + ≥ 1 2η ∥Θ+ −Θ ∗∥2 2 +∥Θ−Θ +∥2 2 − ∥Θ−Θ ∗∥2 2 . Therefore, g,Θ−Θ ∗ = g,Θ−Θ + − g,Θ ∗ −Θ + ≤ g,Θ−Θ + + 1 2η ∥Θ−Θ ∗∥2 2 − ∥Θ+ −Θ ∗∥2 2 − ∥Θ−Θ +∥2 2 ≤ η 2 ∥g∥2 2 + 1 2η ∥Θ−Θ +∥2 2 + ...
-
[71]
It then follows that TX k=1 EA LS(eΘ(k)) ≤4TL S(Θ∗) + 3 2η ∥eΘ(0)−Θ ∗∥2 2 + 9mpηT 2(1 + log(2T /δ) ϵ ) 2n2ϵ C1d+C 2 + 2LS(eΘ(0)).(42) Therefore, 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 3 2ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dlog(2T /δ) n2ϵ2 + 2 T LS(eΘ(0)). By further using eΛ2 Θ∗ ≥ηL S(eΘ(0)), it holds 1 T TX k=1 EA LS(eΘ(k)) ≤4L S(Θ∗) + 4 ηT ∥eΘ(0)−Θ ∗∥2 2 + mp4ηT dl...
-
[72]
These match the width conditions stated in the theorem
and m≲ (nϵ)4 p10(log( m δ )+R2)d2(ηT) 4 log2(T /δ). These match the width conditions stated in the theorem. Then, from Lemma D.11 we know the algorithmAis on average argument ∥ eΘ(0)−Θ∗∥2 2 n + mp4dη2T 2 log(T /δ) n3ϵ2 -stable. From Lemma B.1, the condition for m and the fact maxi∈[n] ∥ci(k)∥2 ≤ ∥c(k) − c(0)∥2 ≤R 2 ≤R , we know the Lipschitz constant of t...
-
[73]
This completes the proof of the theorem
≳R 4. This completes the proof of the theorem. Theorem D.13(Restatement of Theorem 4.15).Let the sequence {eΘ(k)}T k=1 be produced by Algorithm 1 with step size η > 0. LetΘ ∗ be an reference point that obtain small training loss satisfying ∥eΘ(0) − Θ∗∥2 2 ≥ Cmax{ηT (LS(Θ∗)+ L eS(Θ∗)), η(LS(eΘ(0))+ L eS(eΘ(0)))}. If η≤min{ 1/3¯ρ,1}, m≳p 2(log( m δ )+ R2)(R...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.