arxiv: 2605.12306 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CV

Recognition: no theorem link

KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks

Minjong Cheon

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords continual learningcatastrophic forgettingKolmogorov-Arnold networksspline localityneural tangent kernelregularizationSplit-CIFAR

0 comments

The pith

KAN-CL shows that per-knot regularization in Kolmogorov-Arnold network heads reduces forgetting by 88 percent and 93 percent on two Split-CIFAR continual learning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Kolmogorov-Arnold Networks can serve as classification heads that allow regularization to be applied at the granularity of individual spline knots. This per-knot importance weighting, when used together with standard regularization on the convolutional backbone, produces substantial reductions in catastrophic forgetting on standard benchmarks. The key supporting argument is that the local support of the splines creates a rank deficiency in the neural tangent kernel between different tasks. This deficiency limits interference even if the underlying features are still being adapted during training. If the claim holds, continual learning systems gain a new structural lever for protecting old knowledge without sacrificing new task performance.

Core claim

KAN-CL performs importance-weighted anchoring at per-knot granularity by exploiting the compact-support spline parameterization of Kolmogorov-Arnold Networks. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone, KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. The paper further provides a Neural Tangent Kernel analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in thefeature

What carries the argument

Per-knot importance regularization using the compact-support splines of KANs, which anchors parameters locally per knot rather than globally per neuron.

If this is right

Standard backbone regularizers like EWC can be combined with KAN heads for additive forgetting control.
The NTK rank deficit supplies a forgetting bound without requiring frozen features.
Performance on current tasks remains competitive with or better than existing methods.
The gains hold across different numbers of tasks in the Split-CIFAR suite.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Replacing MLP heads with KAN heads in other continual learning pipelines could yield similar locality benefits.
Exploring KANs in non-vision continual learning tasks would test whether the spline advantage is domain-specific.
Quantifying the exact rank deficit could allow prediction of forgetting rates for new task sequences.
Hybrid models that use KAN layers deeper in the network might extend the rank-deficit effect beyond the head.

Load-bearing premise

The local support of KAN splines induces a rank deficit in the cross-task neural tangent kernel that bounds forgetting even when features are learned.

What would settle it

Directly computing the rank of the cross-task NTK for both KAN and MLP heads trained on the same sequence of tasks; absence of a larger deficit for KAN would undermine the theoretical account.

Figures

Figures reproduced from arXiv: 2605.12306 by Minjong Cheon.

**Figure 1.** Figure 1: Main results on Split-CIFAR benchmarks (Task-IL, 3 seeds). Left: average accuracy (↑); right: forgetting (↓). KAN-CL+bbEWC (dark teal) achieves the lowest forgetting on all three benchmarks while matching or exceeding the accuracy of all baselines. Error bars denote standard deviation. Take-aways. KAN-CL+bbEWC achieves the lowest forgetting on both benchmarks by a wide margin while matching or exceeding th… view at source ↗

**Figure 2.** Figure 2: Forgetting reduction of KAN-CL+bbEWC relative to KAN-CL (head only). Arrows connect the head-only baseline (purple) to the full method (dark teal); EWC (global) is shown as a reference (blue). KANCL+bbEWC reduces forgetting by 88% and 93% on CIFAR-10/5T and CIFAR-100/10T respectively [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Anchor annealing for class-IL with replay. Left: Split-MNIST; right: Split-CIFAR-10. Curves show KAN-CL accuracy versus replay scale ρ for several decay rates δ. Dashed lines indicate KAN+replay and MLP+replay without the anchor. Architecture as regularizer. Our results show that replacing an MLP head with a KAN head—and pairing it with appropriate per-knot regularization—achieves a qualitatively different… view at source ↗

**Figure 4.** Figure 4: Component ablation on Permuted-MNIST/10T (left) and CIFAR-100/10T (right). Solid bars: average accuracy; hatched bars: forgetting. Removing the L2 anchor (λ=0) or backbone EWC (λb=0) causes the most severe forgetting increase. stationary. Whether the absolute forgetting bound of Theorem 4 remains tight across task switches—rather than merely the relative KAN vs. MLP comparison—is an open theoretical questi… view at source ↗

**Figure 5.** Figure 5: Empirical cross-task NTK coupling at initialization. Normalized operator norm for KAN head vs. MLP head across five datasets. KAN is consistently below MLP, with the largest gap on CIFAR-100 (the highestdimensional feature space). [8] S.-A. Rebuffi et al. iCaRL: Incremental classifier and representation learning. CVPR, 2017. [9] A. Chaudhry et al. On tiny episodic memories in continual learning. arXiv:190… view at source ↗

read the original abstract

Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform per-knot importance-weighted anchoring. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC), it reports forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T, while matching or exceeding baseline accuracies. It further provides an NTK analysis claiming that KAN spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime.

Significance. If the empirical results prove robust under full verification and the NTK bound is rigorously established, the work would be significant for continual learning. It demonstrates a compositional approach combining architecture-induced parameter locality (KAN head) with complementary regularization (bbEWC), achieving large forgetting reductions on standard benchmarks. The attempt to derive a bound beyond the lazy-training limit is a notable strength, though it currently rests on a high-level sketch.

major comments (2)

[NTK analysis] NTK analysis section: The central theoretical claim asserts that spline locality creates a structural rank deficit in the cross-task NTK, producing a forgetting bound that holds in the feature-learning regime. Standard NTK theory linearizes around fixed features in the infinite-width lazy limit; the manuscript provides only a sketched derivation and does not explicitly handle dynamic kernel evolution under gradient-driven feature adaptation in finite-width networks. This is load-bearing for the theoretical contribution.
[Empirical evaluation] Empirical evaluation (abstract and §4): The abstract states concrete forgetting reductions (88% on Split-CIFAR-10/5T, 93% on Split-CIFAR-100/10T) and accuracy parity, but the manuscript lacks full details on data splits, number of independent runs, statistical significance tests, or complete hyperparameter tables. Without these, the central empirical claim cannot be verified.

minor comments (2)

[Abstract] Abstract: The term 'bbEWC' is used without prior definition; expand the acronym or define it on first use for clarity.
[Methods] Methods section: Notation for per-knot importance weights and regularization terms should be introduced consistently and cross-referenced to the NTK derivation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications and committing to specific revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [NTK analysis] NTK analysis section: The central theoretical claim asserts that spline locality creates a structural rank deficit in the cross-task NTK, producing a forgetting bound that holds in the feature-learning regime. Standard NTK theory linearizes around fixed features in the infinite-width lazy limit; the manuscript provides only a sketched derivation and does not explicitly handle dynamic kernel evolution under gradient-driven feature adaptation in finite-width networks. This is load-bearing for the theoretical contribution.

Authors: We appreciate the referee's emphasis on rigor for the NTK analysis. The manuscript's sketch demonstrates that KAN spline compact support induces localized kernel entries, creating a structural rank deficit in the cross-task NTK that limits interference even under feature adaptation. To address the concern about dynamic evolution in finite-width networks, the revised version will expand the section and add an appendix with a more detailed derivation. This will incorporate recent analyses of NTK dynamics beyond the lazy regime (e.g., via gradient flow on finite-width models) to show that the locality-induced rank deficiency persists, thereby making the forgetting bound hold in the feature-learning setting. revision: yes
Referee: [Empirical evaluation] Empirical evaluation (abstract and §4): The abstract states concrete forgetting reductions (88% on Split-CIFAR-10/5T, 93% on Split-CIFAR-100/10T) and accuracy parity, but the manuscript lacks full details on data splits, number of independent runs, statistical significance tests, or complete hyperparameter tables. Without these, the central empirical claim cannot be verified.

Authors: We agree that full reproducibility details are essential. The revised manuscript will expand §4 (and the appendix) with: (i) exact definitions and preprocessing steps for the Split-CIFAR-10/5T and Split-CIFAR-100/10T splits; (ii) results aggregated over five independent runs with different random seeds, including mean and standard deviation; (iii) statistical significance tests (paired t-tests with reported p-values) comparing KAN-CL against baselines; and (iv) complete hyperparameter tables for all methods, including learning rates, regularization strengths, and KAN-specific spline parameters. These additions will allow independent verification of the reported forgetting reductions and accuracy parity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; NTK bound derived from spline locality as independent analysis

full rationale

The paper's central theoretical claim is an NTK analysis that starts from KAN spline locality to induce a structural rank deficit in the cross-task NTK and thereby obtain a forgetting bound. This is presented as a derivation rather than a fit to target data or a self-citation chain. Empirical results (88% and 93% forgetting reductions on Split-CIFAR-10/5T and Split-CIFAR-100/10T) are reported as separate experimental outcomes using bbEWC on the backbone plus per-knot regularization on the KAN head; they do not reduce by construction to the NTK equations. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard NTK approximation for wide networks and the compact-support property of B-splines; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Neural Tangent Kernel approximation holds for the combined backbone-plus-KAN architecture
Invoked to derive the cross-task rank deficit and forgetting bound
standard math B-splines have strictly local support
Used to justify per-knot importance weighting

pith-pipeline@v0.9.0 · 5521 in / 1432 out tokens · 36509 ms · 2026-05-13T07:09:19.227345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

Bengio, A

Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2013

work page 2013
[2]

McCloskey and N

M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks. InPsychology of Learning and Motivation, vol. 24, 1989

work page 1989
[3]

R. M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4), 1999

work page 1999
[4]

Kirkpatrick et al

J. Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks.PNAS, 114(13), 2017

work page 2017
[5]

Schwarz et al

J. Schwarz et al. Progress & compress: A scalable framework for continual learning.ICML, 2018

work page 2018
[6]

Zenke, B

F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence.ICML, 2017

work page 2017
[7]

Aljundi et al

R. Aljundi et al. Memory aware synapses: Learning what (not) to forget.ECCV, 2018. 11 Figure 5:Empirical cross-task NTK coupling at initialization.Normalized operator norm for KAN head vs. MLP head across five datasets. KAN is consistently below MLP, with the largest gap on CIFAR-100 (the highest- dimensional feature space)

work page 2018
[8]

Rebuffi et al

S.-A. Rebuffi et al. iCaRL: Incremental classifier and representation learning.CVPR, 2017

work page 2017
[9]

On Tiny Episodic Memories in Continual Learning

A. Chaudhry et al. On tiny episodic memories in continual learning.arXiv:1902.10486, 2019

work page Pith review arXiv 1902
[10]

Lopez-Paz and M

D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning.NeurIPS, 2017

work page 2017
[11]

Buzzega et al

P. Buzzega et al. Dark experience for general continual learning.NeurIPS, 2020

work page 2020
[12]

Li and D

Z. Li and D. Hoiem. Learning without forgetting. InECCV, 2016

work page 2016
[13]

Farajtabar, N

M. Farajtabar et al. Orthogonal gradient descent for continual learning.arXiv:1910.07104, 2020

work page arXiv 1910
[14]

A. A. Rusu et al. Progressive neural networks.arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Mallya and S

A. Mallya and S. Lazebnik. PackNet: Adding multiple tasks to a single network by iterative pruning. InCVPR, 2018

work page 2018
[16]

G. M. van de Ven, T. Tuytelaars, and A. S. Tolias. Three types of incremental learning.Nature Machine Intelligence, 4, 2022

work page 2022
[17]

KAN: Kolmogorov-Arnold Networks

Z. Liu et al. KAN: Kolmogorov–Arnold Networks.arXiv:2404.19756, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

arXiv preprint arXiv:2408.10205 (2024),https: //arxiv.org/abs/2408.10205

Z. Liu et al. KAN 2.0: Kolmogorov–Arnold Networks meet science.arXiv:2408.10205, 2024

work page arXiv 2024
[19]

Somvanshi et al

S. Somvanshi et al. A survey on Kolmogorov–Arnold Networks.arXiv:2411.06078, 2024

work page arXiv 2024
[20]

Xu et al

J. Xu et al. Fourier or wavelet bases as counterpart self-attention in deep learning.ICLR, 2024

work page 2024
[21]

Genet and M

R. Genet and M. Inzirillo. TKAN: Temporal Kolmogorov–Arnold Networks.arXiv:2405.07344, 2024

work page arXiv 2024
[22]

Gkan: Graph kolmogorov-arnold networks,

M. Kiamari, M. Kiamari, and B. Krishnamachari. GKAN: Graph Kolmogorov–Arnold Networks.arXiv:2406.06470, 2024

work page arXiv 2024
[23]

Qiu et al

S. Qiu et al. PowerMLP: An efficient version of KAN. InAAAI, 2025

work page 2025
[24]

Neuraltangentkernel: Convergenceandgeneralizationinneuralnetworks.NeurIPS, 2018

A.Jacot, F.Gabriel, andC.Hongler. Neuraltangentkernel: Convergenceandgeneralizationinneuralnetworks.NeurIPS, 2018

work page 2018
[25]

Lee et al

J. Lee et al. Wide neural networks of any depth evolve as linear models under gradient descent.NeurIPS, 2019. 12

work page 2019
[26]

Doan et al

T. Doan et al. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix.AISTATS, 2021

work page 2021
[27]

Bennani, T

M. Bennani, T. Doan, and M. Sugiyama. Generalisation guarantees for continual learning with orthogonal gradient descent.arXiv:2006.11942, 2020

work page arXiv 2006
[28]

M. Cheon. Demonstrating the efficacy of Kolmogorov–Arnold Networks in vision tasks.arXiv:2406.14916, 2024

work page arXiv 2024
[29]

M. Cheon. Kolmogorov–Arnold Network for satellite image classification in remote sensing.arXiv:2406.00600, 2024

work page arXiv 2024
[30]

Cheon and C

M. Cheon and C. Mun. Combining KAN with CNN: KonvNeXt’s performance in remote sensing and patent insights. Remote Sensing, 16(18):3417, 2024

work page 2024
[31]

S. T. Seydi. Kolmogorov–Arnold Networks: A critical assessment of claims, performance, and practical viability. arXiv:2407.11075, 2024

work page arXiv 2024
[32]

A. D. Bodner et al. A preliminary study on continual learning in computer vision using Kolmogorov–Arnold Networks. arXiv:2409.13550, 2024

work page arXiv 2024
[33]

Liu et al

X. Liu et al. Rotate your networks: Better weight consolidation and less catastrophic forgetting.ICPR, 2018

work page 2018
[34]

Wang et al

Z. Wang et al. EWC++: Improved Bayesian continual learning with optimized plasticity.arXiv:2302.03535, 2023. 13 Supplementary Material The supplementary material provides additional experimental detail. All main claims are supported by the figures and tables in the main text. S1 Pure KAN Backbone Results Table S1.1 reports results for pure KAN backbones (...

work page arXiv 2023