SHARe-KAN: Post-Training Vector Quantization for Cache-Resident KAN Inference
Pith reviewed 2026-05-16 22:53 UTC · model grok-4.3
The pith
SHARe-KAN compresses pre-trained KAN prediction heads 9.3 times via post-training vector quantization on spline coefficients while dropping only 2 points of accuracy on detection tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pre-trained Vision Kolmogorov-Arnold Networks store dense B-spline grids on every edge that inflate prediction-head parameter counts by more than 140X relative to a comparable MLP. SHARe-KAN performs post-training vector quantization through a Gain-Shape-Bias decomposition and a layer-shared codebook, paired with an ExecuTorch runtime that keeps the codebook resident in on-chip L2. On PASCAL VOC with a ResNet-50 backbone this yields 9.3X compression of the prediction head (6.32 MB versus 58.67 MB) at a 2.0-point mAP cost with no retraining; zero-shot transfer to COCO retains 88.9 percent of the dense baseline, and scaling to 50 heads reduces storage from 2.9 GB to 211 MB.
What carries the argument
Gain-Shape-Bias decomposition with a layer-shared codebook that clusters and quantizes pre-trained B-spline coefficients into a compact representation mapped to on-chip L2 cache.
If this is right
- KAN prediction heads become small enough for direct deployment on memory-limited edge accelerators without retraining.
- At fifty task heads the method reduces total storage from 2.9 GB to 211 MB, enabling multi-expert KAN systems on current edge silicon.
- Most accuracy loss originates in the clustering step; moving from FP32 to Int8 adds only 1.3 retention points.
- Zero-shot transfer across datasets remains viable, with the COCO gap attributable mainly to the quantization itself.
- The runtime mapping to L2 cache keeps inference in a cache-resident regime rather than memory-bound.
Where Pith is reading between the lines
- The shared-codebook design could extend to other grid- or spline-heavy architectures beyond KANs for similar post-training compression.
- Combining the quantization with lightweight task-specific calibration might close the remaining accuracy gap without full retraining.
- At larger numbers of heads the storage savings would compound further, potentially allowing hundreds of KAN experts within the same memory budget.
- The approach separates the cost of the quantization step from the cost of Int8 representation, suggesting independent levers for future accuracy recovery.
Load-bearing premise
A single codebook obtained by clustering the pre-trained spline coefficients on one task will preserve accuracy on new domains and additional task heads without any fine-tuning.
What would settle it
Apply the same Int8 codebook to a new detection task or domain and measure whether the mAP drop exceeds the reported 2-point in-domain loss or the 11.1 percent retention gap observed on COCO.
Figures
read the original abstract
Pre-trained Vision Kolmogorov-Arnold Networks (KANs) store a dense B-spline grid on every edge, inflating prediction-head parameter counts by more than 140X relative to a comparable MLP and pushing inference into a memory-bound regime on edge accelerators. Standard magnitude pruning fails on these pre-trained models: zero-shot sparsity collapses accuracy, and restoring it requires an iterative fine-tuning loop that is impractical in deployment settings. We present SHARe-KAN, a post-training compiler that compresses spline coefficients via a Gain-Shape-Bias decomposition with a layer-shared codebook, paired with LUTHAM, an ExecuTorch runtime that maps the codebook into on-chip L2. On PASCAL VOC detection with a ResNet-50 backbone, SHARe-KAN Int8 reaches 9.3X storage compression over the Dense KAN baseline (6.32 MB vs. 58.67 MB prediction head) at a 2.0 point in-domain accuracy cost (80.22% vs. 82.22% mAP), with no retraining. Zero-shot transfer to COCO retains 88.9% of the Dense KAN mAP; most of this gap comes from the VQ clustering step itself, and further quantization from FP32 to Int8 costs only 1.3 retention points. The value of the approach compounds at scale: at 50 task heads, Dense KAN prediction-head storage reaches 2.9 GB while SHARe-KAN Int8 requires 211 MB, a 13.9X reduction that brings multi-expert KAN deployment within the memory budgets of contemporary edge silicon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SHARe-KAN, a post-training vector quantization method for pre-trained Vision KANs that applies a Gain-Shape-Bias decomposition to spline coefficients using a single layer-shared codebook, paired with the LUTHAM ExecuTorch runtime for L2-cache-resident inference. On PASCAL VOC detection with ResNet-50, it reports 9.3X compression of the prediction head (6.32 MB vs. 58.67 MB) at a 2.0-point mAP cost (80.22% vs. 82.22%) with no retraining; zero-shot COCO transfer retains 88.9% of dense KAN mAP, with most loss from the VQ step itself. At 50 task heads the method yields a 13.9X storage reduction.
Significance. If the empirical results hold under broader validation, the work meaningfully lowers the memory barrier that currently prevents KANs from edge deployment, particularly for multi-expert or multi-task settings where dense spline grids exceed on-chip budgets. The post-training, no-retraining design and concrete scaling numbers at 50 heads are practical strengths.
major comments (3)
- [§4.2] §4.2 (Codebook Construction): the decision to use a single layer-shared codebook obtained by k-means on one task's pre-trained coefficients is load-bearing for the cross-domain claim, yet the manuscript provides no quantitative comparison of coefficient distributions (scale, kurtosis, or support) across layers or between PASCAL VOC and COCO; without this, the 11.1% zero-shot mAP drop cannot be confidently attributed solely to VQ rather than irreversible loss of task-specific statistics.
- [Table 2] Table 2 and §5.1: the reported mAP figures (80.22%, 88.9% retention) lack error bars, standard deviations, or results over multiple clustering seeds; given that codebook size and bit-width are free parameters, the 2.0-point in-domain drop and the claim that further Int8 quantization costs only 1.3 retention points cannot be assessed for statistical robustness.
- [§5.3] §5.3 (Scaling Experiment): the 13.9X reduction at 50 heads assumes the same shared codebook generalizes without per-task adaptation; an ablation showing how approximation error grows with the number of distinct coefficient distributions would directly test whether the method remains viable beyond the two-task regime reported.
minor comments (3)
- [Figure 4] Figure 4: axis labels and legend do not clearly distinguish the dense KAN, SHARe-KAN FP32, and Int8 variants; the compression ratios are hard to read at a glance.
- [§3.1] §3.1: the Gain-Shape-Bias decomposition is introduced without an explicit equation for the reconstructed coefficient; adding the reconstruction formula would clarify how the shared codebook is applied at inference.
- [Related Work] Related Work: the discussion of prior post-training quantization for MLPs and transformers is present but does not cite recent KAN-specific compression attempts; a short sentence situating SHARe-KAN relative to them would help.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Codebook Construction): the decision to use a single layer-shared codebook obtained by k-means on one task's pre-trained coefficients is load-bearing for the cross-domain claim, yet the manuscript provides no quantitative comparison of coefficient distributions (scale, kurtosis, or support) across layers or between PASCAL VOC and COCO; without this, the 11.1% zero-shot mAP drop cannot be confidently attributed solely to VQ rather than irreversible loss of task-specific statistics.
Authors: We agree that a quantitative comparison of coefficient statistics would strengthen attribution of the observed drop. In the revised manuscript we will add a supplementary table reporting mean, standard deviation, kurtosis, and support range of the spline coefficients for each layer on PASCAL VOC and for the corresponding layers on COCO. This will allow readers to assess the similarity that justifies the shared codebook. revision: yes
-
Referee: [Table 2] Table 2 and §5.1: the reported mAP figures (80.22%, 88.9% retention) lack error bars, standard deviations, or results over multiple clustering seeds; given that codebook size and bit-width are free parameters, the 2.0-point in-domain drop and the claim that further Int8 quantization costs only 1.3 retention points cannot be assessed for statistical robustness.
Authors: The referee is correct that variability due to k-means initialization should be quantified. We will rerun the codebook construction with 10 different random seeds, recompute the mAP values, and report means together with standard deviations in the updated Table 2 and the corresponding paragraphs of §5.1. revision: yes
-
Referee: [§5.3] §5.3 (Scaling Experiment): the 13.9X reduction at 50 heads assumes the same shared codebook generalizes without per-task adaptation; an ablation showing how approximation error grows with the number of distinct coefficient distributions would directly test whether the method remains viable beyond the two-task regime reported.
Authors: We acknowledge that a systematic ablation of approximation error versus number of distinct tasks would provide stronger evidence of scalability. Performing a full multi-task ablation with 50 independent coefficient distributions is computationally expensive and was outside the scope of the original experiments. In the revision we will add a brief discussion of the error growth observed when the shared codebook is applied to the 50-head setting and will explicitly list a controlled multi-distribution ablation as future work. revision: partial
Circularity Check
Post-training VQ procedure is self-contained with no circular reductions
full rationale
The paper presents SHARe-KAN as a post-training compiler that applies Gain-Shape-Bias decomposition and k-means clustering to obtain a layer-shared codebook from pre-trained spline coefficients, followed by Int8 quantization. Reported metrics (9.3X compression, 80.22% mAP on VOC, 88.9% retention on COCO) are obtained by direct measurement on held-out data with no retraining. No equations in the manuscript reduce the compression ratio or accuracy figures to a fitted parameter or self-defined quantity inside the paper. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core pipeline; the method is an empirical procedure whose outputs are independently verifiable against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- codebook size
- bit width
axioms (1)
- domain assumption Vector quantization of shape vectors preserves sufficient information for downstream task accuracy
Reference graph
Works this paper leans on
-
[1]
Tvm: An automated end-to- end optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An automated end-to- end optimizing compiler for deep learning. InOSDI, pages 578–594, 2018
work page 2018
-
[2]
The pascal visual object classes (voc) challenge.IJCV, 88(2):303–338, 2010
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisser- man. The pascal visual object classes (voc) challenge.IJCV, 88(2):303–338, 2010
work page 2010
-
[3]
Wangxuan Fan, Ching Wang, Siqi Li, and Nan Liu. Shift-invariant attribute scoring for kolmogorov-arnold networks via shapley value.arXiv preprint arXiv:2510.01663, 2025
-
[4]
The lottery ticket hypothesis: Finding sparse, trainable neural networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR, 2019
work page 2019
-
[5]
Kluwer Aca- demic Publishers, 1992
Allen Gersho and Robert M Gray.Vector quantization and signal compression. Kluwer Aca- demic Publishers, 1992
work page 1992
-
[6]
Compressing Deep Convolutional Networks using Vector Quantization
Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. InarXiv preprint arXiv:1412.6115, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InICLR, 2016
work page 2016
-
[8]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016
work page 2016
-
[9]
Amc: Automl for model compression and acceleration on mobile devices
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. InECCV, pages 784–800, 2018. 13
work page 2018
-
[10]
1.1 computing’s energy problem (and what we can do about it)
Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). InIEEE International Solid-State Circuits Conference, pages 10–14, 2014
work page 2014
-
[11]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InCVPR, pages 2704–2713, 2018
work page 2018
-
[12]
The universal weight subspace hypothesis.arXiv preprint arXiv:2512.05117, 2025
Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, and Alan Yuille. The universal weight subspace hypothesis.arXiv preprint arXiv:2512.05117, 2025
-
[13]
Pengqi Li, Lizhong Ding, Jiarun Fu, Chunhui Zhang, Ye Yuan, and Guoren Wang. General- ization bounds for kolmogorov-arnold networks (kans) and enhanced kans with lower lipschitz complexity. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[14]
LFM2 Technical Report.arXiv preprint arXiv:2511.23404, 2025
Liquid AI. LFM2 Technical Report.arXiv preprint arXiv:2511.23404, 2025
-
[15]
Learning efficient convolutional networks through network slimming
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InICCV, pages 2736– 2744, 2017
work page 2017
-
[16]
Kan: Kolmogorov-arnold networks
Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks. InProceedings of the International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[17]
Nerf: Representing scenes as neural radiance fields for view synthesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421, 2020
work page 2020
-
[18]
Importance estimation for neural network pruning.CVPR, 2019
Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning.CVPR, 2019
work page 2019
-
[19]
Thomas M¨ uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM Transactions on Graphics (ToG), 41 (4):1–15, 2022
work page 2022
-
[20]
Holographic reduced representations.IEEE Transactions on Neural Networks, 6(3):623–641, 1995
Tony A Plate. Holographic reduced representations.IEEE Transactions on Neural Networks, 6(3):623–641, 1995
work page 1995
-
[21]
Holographic storage.Computer, 31(2):52–60, 1998
Demetri Psaltis and Geoffrey W Burr. Holographic storage.Computer, 31(2):52–60, 1998
work page 1998
-
[22]
Executorch: Enabling on-device ai across mobile and embedded devices
PyTorch Team. Executorch: Enabling on-device ai across mobile and embedded devices. https://pytorch.org/executorch, 2024
work page 2024
-
[23]
Matthew Raffel, Abhijith Renjith, and Lizhong Chen. MetaCluster: Enabling Deep Compres- sion of Kolmogorov-Arnold Network.arXiv preprint arXiv:2510.19105, 2025
-
[24]
Implicit neural representations with periodic activation functions
Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. InNeurIPS, volume 33, pages 7462–7473, 2020
work page 2020
-
[25]
And the bit goes down: Revisiting the quantization of neural networks
Pierre Stock, Armand Joulin, R´ emi Gribonval, Benjamin Graham, and Herv´ e J´ egou. And the bit goes down: Revisiting the quantization of neural networks. InICLR, 2020. 14
work page 2020
-
[26]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017
work page 2017
-
[27]
PRKAN: Parameter-Reduced Kolmogorov-Arnold Networks.arXiv preprint arXiv:2501.07032, 2025
Hoang-Thang Ta, Duy-Quy Thai, et al. PRKAN: Parameter-Reduced Kolmogorov-Arnold Networks.arXiv preprint arXiv:2501.07032, 2025
-
[28]
Mnasnet: Platform-aware neural architecture search for mobile
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. InCVPR, pages 2820–2828, 2019. 15 A Additional Experimental Details A.1 Hyperparameters All KAN models use the following configuration: •Spline basis: Cubic B-splines (k= 3) •Grid size:G= 10 point...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.