Recognition: no theorem link
KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks
Pith reviewed 2026-05-13 07:09 UTC · model grok-4.3
The pith
KAN-CL shows that per-knot regularization in Kolmogorov-Arnold network heads reduces forgetting by 88 percent and 93 percent on two Split-CIFAR continual learning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KAN-CL performs importance-weighted anchoring at per-knot granularity by exploiting the compact-support spline parameterization of Kolmogorov-Arnold Networks. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone, KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. The paper further provides a Neural Tangent Kernel analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in thefeature
What carries the argument
Per-knot importance regularization using the compact-support splines of KANs, which anchors parameters locally per knot rather than globally per neuron.
If this is right
- Standard backbone regularizers like EWC can be combined with KAN heads for additive forgetting control.
- The NTK rank deficit supplies a forgetting bound without requiring frozen features.
- Performance on current tasks remains competitive with or better than existing methods.
- The gains hold across different numbers of tasks in the Split-CIFAR suite.
Where Pith is reading between the lines
- Replacing MLP heads with KAN heads in other continual learning pipelines could yield similar locality benefits.
- Exploring KANs in non-vision continual learning tasks would test whether the spline advantage is domain-specific.
- Quantifying the exact rank deficit could allow prediction of forgetting rates for new task sequences.
- Hybrid models that use KAN layers deeper in the network might extend the rank-deficit effect beyond the head.
Load-bearing premise
The local support of KAN splines induces a rank deficit in the cross-task neural tangent kernel that bounds forgetting even when features are learned.
What would settle it
Directly computing the rank of the cross-task NTK for both KAN and MLP heads trained on the same sequence of tasks; absence of a larger deficit for KAN would undermine the theoretical account.
Figures
read the original abstract
Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform per-knot importance-weighted anchoring. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC), it reports forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T, while matching or exceeding baseline accuracies. It further provides an NTK analysis claiming that KAN spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime.
Significance. If the empirical results prove robust under full verification and the NTK bound is rigorously established, the work would be significant for continual learning. It demonstrates a compositional approach combining architecture-induced parameter locality (KAN head) with complementary regularization (bbEWC), achieving large forgetting reductions on standard benchmarks. The attempt to derive a bound beyond the lazy-training limit is a notable strength, though it currently rests on a high-level sketch.
major comments (2)
- [NTK analysis] NTK analysis section: The central theoretical claim asserts that spline locality creates a structural rank deficit in the cross-task NTK, producing a forgetting bound that holds in the feature-learning regime. Standard NTK theory linearizes around fixed features in the infinite-width lazy limit; the manuscript provides only a sketched derivation and does not explicitly handle dynamic kernel evolution under gradient-driven feature adaptation in finite-width networks. This is load-bearing for the theoretical contribution.
- [Empirical evaluation] Empirical evaluation (abstract and §4): The abstract states concrete forgetting reductions (88% on Split-CIFAR-10/5T, 93% on Split-CIFAR-100/10T) and accuracy parity, but the manuscript lacks full details on data splits, number of independent runs, statistical significance tests, or complete hyperparameter tables. Without these, the central empirical claim cannot be verified.
minor comments (2)
- [Abstract] Abstract: The term 'bbEWC' is used without prior definition; expand the acronym or define it on first use for clarity.
- [Methods] Methods section: Notation for per-knot importance weights and regularization terms should be introduced consistently and cross-referenced to the NTK derivation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications and committing to specific revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [NTK analysis] NTK analysis section: The central theoretical claim asserts that spline locality creates a structural rank deficit in the cross-task NTK, producing a forgetting bound that holds in the feature-learning regime. Standard NTK theory linearizes around fixed features in the infinite-width lazy limit; the manuscript provides only a sketched derivation and does not explicitly handle dynamic kernel evolution under gradient-driven feature adaptation in finite-width networks. This is load-bearing for the theoretical contribution.
Authors: We appreciate the referee's emphasis on rigor for the NTK analysis. The manuscript's sketch demonstrates that KAN spline compact support induces localized kernel entries, creating a structural rank deficit in the cross-task NTK that limits interference even under feature adaptation. To address the concern about dynamic evolution in finite-width networks, the revised version will expand the section and add an appendix with a more detailed derivation. This will incorporate recent analyses of NTK dynamics beyond the lazy regime (e.g., via gradient flow on finite-width models) to show that the locality-induced rank deficiency persists, thereby making the forgetting bound hold in the feature-learning setting. revision: yes
-
Referee: [Empirical evaluation] Empirical evaluation (abstract and §4): The abstract states concrete forgetting reductions (88% on Split-CIFAR-10/5T, 93% on Split-CIFAR-100/10T) and accuracy parity, but the manuscript lacks full details on data splits, number of independent runs, statistical significance tests, or complete hyperparameter tables. Without these, the central empirical claim cannot be verified.
Authors: We agree that full reproducibility details are essential. The revised manuscript will expand §4 (and the appendix) with: (i) exact definitions and preprocessing steps for the Split-CIFAR-10/5T and Split-CIFAR-100/10T splits; (ii) results aggregated over five independent runs with different random seeds, including mean and standard deviation; (iii) statistical significance tests (paired t-tests with reported p-values) comparing KAN-CL against baselines; and (iv) complete hyperparameter tables for all methods, including learning rates, regularization strengths, and KAN-specific spline parameters. These additions will allow independent verification of the reported forgetting reductions and accuracy parity. revision: yes
Circularity Check
No significant circularity; NTK bound derived from spline locality as independent analysis
full rationale
The paper's central theoretical claim is an NTK analysis that starts from KAN spline locality to induce a structural rank deficit in the cross-task NTK and thereby obtain a forgetting bound. This is presented as a derivation rather than a fit to target data or a self-citation chain. Empirical results (88% and 93% forgetting reductions on Split-CIFAR-10/5T and Split-CIFAR-100/10T) are reported as separate experimental outcomes using bbEWC on the backbone plus per-knot regularization on the KAN head; they do not reduce by construction to the NTK equations. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Neural Tangent Kernel approximation holds for the combined backbone-plus-KAN architecture
- standard math B-splines have strictly local support
Reference graph
Works this paper leans on
- [1]
-
[2]
M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks. InPsychology of Learning and Motivation, vol. 24, 1989
work page 1989
-
[3]
R. M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4), 1999
work page 1999
-
[4]
J. Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks.PNAS, 114(13), 2017
work page 2017
-
[5]
J. Schwarz et al. Progress & compress: A scalable framework for continual learning.ICML, 2018
work page 2018
- [6]
-
[7]
R. Aljundi et al. Memory aware synapses: Learning what (not) to forget.ECCV, 2018. 11 Figure 5:Empirical cross-task NTK coupling at initialization.Normalized operator norm for KAN head vs. MLP head across five datasets. KAN is consistently below MLP, with the largest gap on CIFAR-100 (the highest- dimensional feature space)
work page 2018
-
[8]
S.-A. Rebuffi et al. iCaRL: Incremental classifier and representation learning.CVPR, 2017
work page 2017
-
[9]
On Tiny Episodic Memories in Continual Learning
A. Chaudhry et al. On tiny episodic memories in continual learning.arXiv:1902.10486, 2019
work page Pith review arXiv 1902
-
[10]
D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning.NeurIPS, 2017
work page 2017
-
[11]
P. Buzzega et al. Dark experience for general continual learning.NeurIPS, 2020
work page 2020
- [12]
-
[13]
M. Farajtabar et al. Orthogonal gradient descent for continual learning.arXiv:1910.07104, 2020
-
[14]
A. A. Rusu et al. Progressive neural networks.arXiv:1606.04671, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[15]
A. Mallya and S. Lazebnik. PackNet: Adding multiple tasks to a single network by iterative pruning. InCVPR, 2018
work page 2018
-
[16]
G. M. van de Ven, T. Tuytelaars, and A. S. Tolias. Three types of incremental learning.Nature Machine Intelligence, 4, 2022
work page 2022
-
[17]
KAN: Kolmogorov-Arnold Networks
Z. Liu et al. KAN: Kolmogorov–Arnold Networks.arXiv:2404.19756, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
arXiv preprint arXiv:2408.10205 (2024),https: //arxiv.org/abs/2408.10205
Z. Liu et al. KAN 2.0: Kolmogorov–Arnold Networks meet science.arXiv:2408.10205, 2024
-
[19]
S. Somvanshi et al. A survey on Kolmogorov–Arnold Networks.arXiv:2411.06078, 2024
- [20]
-
[21]
R. Genet and M. Inzirillo. TKAN: Temporal Kolmogorov–Arnold Networks.arXiv:2405.07344, 2024
-
[22]
Gkan: Graph kolmogorov-arnold networks,
M. Kiamari, M. Kiamari, and B. Krishnamachari. GKAN: Graph Kolmogorov–Arnold Networks.arXiv:2406.06470, 2024
- [23]
-
[24]
Neuraltangentkernel: Convergenceandgeneralizationinneuralnetworks.NeurIPS, 2018
A.Jacot, F.Gabriel, andC.Hongler. Neuraltangentkernel: Convergenceandgeneralizationinneuralnetworks.NeurIPS, 2018
work page 2018
- [25]
-
[26]
T. Doan et al. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix.AISTATS, 2021
work page 2021
-
[27]
M. Bennani, T. Doan, and M. Sugiyama. Generalisation guarantees for continual learning with orthogonal gradient descent.arXiv:2006.11942, 2020
- [28]
- [29]
-
[30]
M. Cheon and C. Mun. Combining KAN with CNN: KonvNeXt’s performance in remote sensing and patent insights. Remote Sensing, 16(18):3417, 2024
work page 2024
- [31]
- [32]
- [33]
-
[34]
Z. Wang et al. EWC++: Improved Bayesian continual learning with optimized plasticity.arXiv:2302.03535, 2023. 13 Supplementary Material The supplementary material provides additional experimental detail. All main claims are supported by the figures and tables in the main text. S1 Pure KAN Backbone Results Table S1.1 reports results for pure KAN backbones (...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.