arxiv: 2512.15742 · v2 · submitted 2025-12-10 · 💻 cs.LG · cs.DC

SHARe-KAN: Post-Training Vector Quantization for Cache-Resident KAN Inference

Jeff Smith This is my paper

Pith reviewed 2026-05-16 22:53 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords Kolmogorov-Arnold NetworksVector QuantizationPost-Training CompressionEdge InferenceB-spline GridsObject DetectionModel CompressionCache Optimization

0 comments

The pith

SHARe-KAN compresses pre-trained KAN prediction heads 9.3 times via post-training vector quantization on spline coefficients while dropping only 2 points of accuracy on detection tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pre-trained KANs store dense B-spline grids on every edge, inflating prediction-head parameters by more than 140 times compared to an MLP and forcing memory-bound inference on edge hardware. Standard pruning fails on these models without expensive retraining loops. SHARe-KAN applies a Gain-Shape-Bias decomposition with a single layer-shared codebook to quantize the coefficients after training, then maps the compact codebook into on-chip L2 cache through a custom runtime. On PASCAL VOC detection the method delivers 9.3X storage reduction at a 2-point mAP cost with no retraining, and the same weights retain 88.9 percent of original accuracy on COCO zero-shot transfer. At 50 task heads the approach shrinks total storage from 2.9 GB to 211 MB, bringing multi-head KAN deployment inside the memory limits of contemporary edge silicon.

Core claim

Pre-trained Vision Kolmogorov-Arnold Networks store dense B-spline grids on every edge that inflate prediction-head parameter counts by more than 140X relative to a comparable MLP. SHARe-KAN performs post-training vector quantization through a Gain-Shape-Bias decomposition and a layer-shared codebook, paired with an ExecuTorch runtime that keeps the codebook resident in on-chip L2. On PASCAL VOC with a ResNet-50 backbone this yields 9.3X compression of the prediction head (6.32 MB versus 58.67 MB) at a 2.0-point mAP cost with no retraining; zero-shot transfer to COCO retains 88.9 percent of the dense baseline, and scaling to 50 heads reduces storage from 2.9 GB to 211 MB.

What carries the argument

Gain-Shape-Bias decomposition with a layer-shared codebook that clusters and quantizes pre-trained B-spline coefficients into a compact representation mapped to on-chip L2 cache.

If this is right

KAN prediction heads become small enough for direct deployment on memory-limited edge accelerators without retraining.
At fifty task heads the method reduces total storage from 2.9 GB to 211 MB, enabling multi-expert KAN systems on current edge silicon.
Most accuracy loss originates in the clustering step; moving from FP32 to Int8 adds only 1.3 retention points.
Zero-shot transfer across datasets remains viable, with the COCO gap attributable mainly to the quantization itself.
The runtime mapping to L2 cache keeps inference in a cache-resident regime rather than memory-bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared-codebook design could extend to other grid- or spline-heavy architectures beyond KANs for similar post-training compression.
Combining the quantization with lightweight task-specific calibration might close the remaining accuracy gap without full retraining.
At larger numbers of heads the storage savings would compound further, potentially allowing hundreds of KAN experts within the same memory budget.
The approach separates the cost of the quantization step from the cost of Int8 representation, suggesting independent levers for future accuracy recovery.

Load-bearing premise

A single codebook obtained by clustering the pre-trained spline coefficients on one task will preserve accuracy on new domains and additional task heads without any fine-tuning.

What would settle it

Apply the same Int8 codebook to a new detection task or domain and measure whether the mAP drop exceeds the reported 2-point in-domain loss or the 11.1 percent retention gap observed on COCO.

Figures

Figures reproduced from arXiv: 2512.15742 by Jeff Smith.

**Figure 1.** Figure 1: The pruning cliff. Vision KANs suffer catastrophic performance collapse under magnitude-based pruning, contrasting with the gradual degradation of standard MLPs, indicating information is distributed rather than localized. 3.1 The Pruning Cliff We trained a KAN-based object detection head (ResNet-50 [8] backbone, SSD-style output) on PASCAL VOC, achieving 85.23% mAP with 223 MB parameters. Following magn… view at source ↗

**Figure 2.** Figure 2: Compression vs. Accuracy Trade-off. SHARe-KAN (Int8) achieves competitive accuracy with 17× smaller model size than Dense KAN, approaching ResNet-50 MLP performance in a 12.91 MB footprint. Quantization Analysis. Comparing SHARe-KAN variants ( [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: VQ Saturation. Reconstruction quality (R²) reaches saturation at K = 65,536, justifying 16-bit index allocation. the DRAM speed limit, demonstrating that the VQ codebook (12.9 MB) resides in the 40 MB L2 cache throughout inference. L2 residency decouples compute throughput from off-chip memory bandwidth, confirming Vector Quantization successfully moved the workload from Memory-Bound (DRAM) to Cache-Bound… view at source ↗

read the original abstract

Pre-trained Vision Kolmogorov-Arnold Networks (KANs) store a dense B-spline grid on every edge, inflating prediction-head parameter counts by more than 140X relative to a comparable MLP and pushing inference into a memory-bound regime on edge accelerators. Standard magnitude pruning fails on these pre-trained models: zero-shot sparsity collapses accuracy, and restoring it requires an iterative fine-tuning loop that is impractical in deployment settings. We present SHARe-KAN, a post-training compiler that compresses spline coefficients via a Gain-Shape-Bias decomposition with a layer-shared codebook, paired with LUTHAM, an ExecuTorch runtime that maps the codebook into on-chip L2. On PASCAL VOC detection with a ResNet-50 backbone, SHARe-KAN Int8 reaches 9.3X storage compression over the Dense KAN baseline (6.32 MB vs. 58.67 MB prediction head) at a 2.0 point in-domain accuracy cost (80.22% vs. 82.22% mAP), with no retraining. Zero-shot transfer to COCO retains 88.9% of the Dense KAN mAP; most of this gap comes from the VQ clustering step itself, and further quantization from FP32 to Int8 costs only 1.3 retention points. The value of the approach compounds at scale: at 50 task heads, Dense KAN prediction-head storage reaches 2.9 GB while SHARe-KAN Int8 requires 211 MB, a 13.9X reduction that brings multi-expert KAN deployment within the memory budgets of contemporary edge silicon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SHARe-KAN gives a workable post-training compression path for KANs but the shared codebook needs more scrutiny on generalization.

read the letter

SHARe-KAN shows a practical post-training way to compress KAN prediction heads by roughly 9x with a small accuracy penalty, using vector quantization on the spline coefficients. The new element is the Gain-Shape-Bias decomposition paired with a single layer-shared codebook. This setup is tailored to KANs and does not come straight from MLP quantization work. The reported results are straightforward: on PASCAL VOC detection the Int8 model uses 6.32 MB instead of 58.67 MB and loses 2 mAP points, while zero-shot transfer to COCO keeps 88.9 percent of the dense performance. The multi-head scaling numbers also look helpful for real deployment. The main concern is whether the shared codebook holds up when the coefficient distributions change across tasks or layers. The COCO drop is already noticeable, and most of it traces back to the clustering step. Without ablations on codebook size or checks for per-layer variation, it is unclear how much task-specific information gets lost in the fixed Int8 mapping. This work is aimed at engineers who want to run KANs on edge devices without fine-tuning loops. Readers focused on efficient inference for spline networks will find the concrete compression ratios and runtime mapping useful. The empirical grounding is strong enough on the tested cases that the paper deserves a serious referee. I would recommend sending it to peer review, with the expectation that reviewers will ask for more tests on domain shift and codebook sensitivity.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces SHARe-KAN, a post-training vector quantization method for pre-trained Vision KANs that applies a Gain-Shape-Bias decomposition to spline coefficients using a single layer-shared codebook, paired with the LUTHAM ExecuTorch runtime for L2-cache-resident inference. On PASCAL VOC detection with ResNet-50, it reports 9.3X compression of the prediction head (6.32 MB vs. 58.67 MB) at a 2.0-point mAP cost (80.22% vs. 82.22%) with no retraining; zero-shot COCO transfer retains 88.9% of dense KAN mAP, with most loss from the VQ step itself. At 50 task heads the method yields a 13.9X storage reduction.

Significance. If the empirical results hold under broader validation, the work meaningfully lowers the memory barrier that currently prevents KANs from edge deployment, particularly for multi-expert or multi-task settings where dense spline grids exceed on-chip budgets. The post-training, no-retraining design and concrete scaling numbers at 50 heads are practical strengths.

major comments (3)

[§4.2] §4.2 (Codebook Construction): the decision to use a single layer-shared codebook obtained by k-means on one task's pre-trained coefficients is load-bearing for the cross-domain claim, yet the manuscript provides no quantitative comparison of coefficient distributions (scale, kurtosis, or support) across layers or between PASCAL VOC and COCO; without this, the 11.1% zero-shot mAP drop cannot be confidently attributed solely to VQ rather than irreversible loss of task-specific statistics.
[Table 2] Table 2 and §5.1: the reported mAP figures (80.22%, 88.9% retention) lack error bars, standard deviations, or results over multiple clustering seeds; given that codebook size and bit-width are free parameters, the 2.0-point in-domain drop and the claim that further Int8 quantization costs only 1.3 retention points cannot be assessed for statistical robustness.
[§5.3] §5.3 (Scaling Experiment): the 13.9X reduction at 50 heads assumes the same shared codebook generalizes without per-task adaptation; an ablation showing how approximation error grows with the number of distinct coefficient distributions would directly test whether the method remains viable beyond the two-task regime reported.

minor comments (3)

[Figure 4] Figure 4: axis labels and legend do not clearly distinguish the dense KAN, SHARe-KAN FP32, and Int8 variants; the compression ratios are hard to read at a glance.
[§3.1] §3.1: the Gain-Shape-Bias decomposition is introduced without an explicit equation for the reconstructed coefficient; adding the reconstruction formula would clarify how the shared codebook is applied at inference.
[Related Work] Related Work: the discussion of prior post-training quantization for MLPs and transformers is present but does not cite recent KAN-specific compression attempts; a short sentence situating SHARe-KAN relative to them would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Codebook Construction): the decision to use a single layer-shared codebook obtained by k-means on one task's pre-trained coefficients is load-bearing for the cross-domain claim, yet the manuscript provides no quantitative comparison of coefficient distributions (scale, kurtosis, or support) across layers or between PASCAL VOC and COCO; without this, the 11.1% zero-shot mAP drop cannot be confidently attributed solely to VQ rather than irreversible loss of task-specific statistics.

Authors: We agree that a quantitative comparison of coefficient statistics would strengthen attribution of the observed drop. In the revised manuscript we will add a supplementary table reporting mean, standard deviation, kurtosis, and support range of the spline coefficients for each layer on PASCAL VOC and for the corresponding layers on COCO. This will allow readers to assess the similarity that justifies the shared codebook. revision: yes
Referee: [Table 2] Table 2 and §5.1: the reported mAP figures (80.22%, 88.9% retention) lack error bars, standard deviations, or results over multiple clustering seeds; given that codebook size and bit-width are free parameters, the 2.0-point in-domain drop and the claim that further Int8 quantization costs only 1.3 retention points cannot be assessed for statistical robustness.

Authors: The referee is correct that variability due to k-means initialization should be quantified. We will rerun the codebook construction with 10 different random seeds, recompute the mAP values, and report means together with standard deviations in the updated Table 2 and the corresponding paragraphs of §5.1. revision: yes
Referee: [§5.3] §5.3 (Scaling Experiment): the 13.9X reduction at 50 heads assumes the same shared codebook generalizes without per-task adaptation; an ablation showing how approximation error grows with the number of distinct coefficient distributions would directly test whether the method remains viable beyond the two-task regime reported.

Authors: We acknowledge that a systematic ablation of approximation error versus number of distinct tasks would provide stronger evidence of scalability. Performing a full multi-task ablation with 50 independent coefficient distributions is computationally expensive and was outside the scope of the original experiments. In the revision we will add a brief discussion of the error growth observed when the shared codebook is applied to the 50-head setting and will explicitly list a controlled multi-distribution ablation as future work. revision: partial

Circularity Check

0 steps flagged

Post-training VQ procedure is self-contained with no circular reductions

full rationale

The paper presents SHARe-KAN as a post-training compiler that applies Gain-Shape-Bias decomposition and k-means clustering to obtain a layer-shared codebook from pre-trained spline coefficients, followed by Int8 quantization. Reported metrics (9.3X compression, 80.22% mAP on VOC, 88.9% retention on COCO) are obtained by direct measurement on held-out data with no retraining. No equations in the manuscript reduce the compression ratio or accuracy figures to a fitted parameter or self-defined quantity inside the paper. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core pipeline; the method is an empirical procedure whose outputs are independently verifiable against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach relies on the empirical effectiveness of vector quantization for spline coefficients and the feasibility of mapping a shared codebook into L2 cache; no new physical or mathematical entities are postulated.

free parameters (2)

codebook size
The number of entries in the shared codebook is chosen to achieve the reported compression; its exact value is not stated in the abstract.
bit width
Int8 quantization is applied after clustering; the choice of 8 bits is a design parameter.

axioms (1)

domain assumption Vector quantization of shape vectors preserves sufficient information for downstream task accuracy
Invoked when the authors state that most of the accuracy gap comes from the VQ step itself.

pith-pipeline@v0.9.0 · 5600 in / 1499 out tokens · 22842 ms · 2026-05-16T22:53:46.481962+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Tvm: An automated end-to- end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An automated end-to- end optimizing compiler for deep learning. InOSDI, pages 578–594, 2018

work page 2018
[2]

The pascal visual object classes (voc) challenge.IJCV, 88(2):303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisser- man. The pascal visual object classes (voc) challenge.IJCV, 88(2):303–338, 2010

work page 2010
[3]

Shift-invariant attribute scoring for kolmogorov-arnold networks via shapley value.arXiv preprint arXiv:2510.01663, 2025

Wangxuan Fan, Ching Wang, Siqi Li, and Nan Liu. Shift-invariant attribute scoring for kolmogorov-arnold networks via shapley value.arXiv preprint arXiv:2510.01663, 2025

work page arXiv 2025
[4]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR, 2019

work page 2019
[5]

Kluwer Aca- demic Publishers, 1992

Allen Gersho and Robert M Gray.Vector quantization and signal compression. Kluwer Aca- demic Publishers, 1992

work page 1992
[6]

Compressing Deep Convolutional Networks using Vector Quantization

Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. InarXiv preprint arXiv:1412.6115, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[7]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InICLR, 2016

work page 2016
[8]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016

work page 2016
[9]

Amc: Automl for model compression and acceleration on mobile devices

Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. InECCV, pages 784–800, 2018. 13

work page 2018
[10]

1.1 computing’s energy problem (and what we can do about it)

Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). InIEEE International Solid-State Circuits Conference, pages 10–14, 2014

work page 2014
[11]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InCVPR, pages 2704–2713, 2018

work page 2018
[12]

The universal weight subspace hypothesis.arXiv preprint arXiv:2512.05117, 2025

Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, and Alan Yuille. The universal weight subspace hypothesis.arXiv preprint arXiv:2512.05117, 2025

work page arXiv 2025
[13]

General- ization bounds for kolmogorov-arnold networks (kans) and enhanced kans with lower lipschitz complexity

Pengqi Li, Lizhong Ding, Jiarun Fu, Chunhui Zhang, Ye Yuan, and Guoren Wang. General- ization bounds for kolmogorov-arnold networks (kans) and enhanced kans with lower lipschitz complexity. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[14]

LFM2 Technical Report.arXiv preprint arXiv:2511.23404, 2025

Liquid AI. LFM2 Technical Report.arXiv preprint arXiv:2511.23404, 2025

work page arXiv 2025
[15]

Learning efficient convolutional networks through network slimming

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InICCV, pages 2736– 2744, 2017

work page 2017
[16]

Kan: Kolmogorov-arnold networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[17]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421, 2020

work page 2020
[18]

Importance estimation for neural network pruning.CVPR, 2019

Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning.CVPR, 2019

work page 2019
[19]

Instant neural graphics primitives with a multiresolution hash encoding.ACM Transactions on Graphics (ToG), 41 (4):1–15, 2022

Thomas M¨ uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding.ACM Transactions on Graphics (ToG), 41 (4):1–15, 2022

work page 2022
[20]

Holographic reduced representations.IEEE Transactions on Neural Networks, 6(3):623–641, 1995

Tony A Plate. Holographic reduced representations.IEEE Transactions on Neural Networks, 6(3):623–641, 1995

work page 1995
[21]

Holographic storage.Computer, 31(2):52–60, 1998

Demetri Psaltis and Geoffrey W Burr. Holographic storage.Computer, 31(2):52–60, 1998

work page 1998
[22]

Executorch: Enabling on-device ai across mobile and embedded devices

PyTorch Team. Executorch: Enabling on-device ai across mobile and embedded devices. https://pytorch.org/executorch, 2024

work page 2024
[23]

MetaCluster: Enabling Deep Compres- sion of Kolmogorov-Arnold Network.arXiv preprint arXiv:2510.19105, 2025

Matthew Raffel, Abhijith Renjith, and Lizhong Chen. MetaCluster: Enabling Deep Compres- sion of Kolmogorov-Arnold Network.arXiv preprint arXiv:2510.19105, 2025

work page arXiv 2025
[24]

Implicit neural representations with periodic activation functions

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. InNeurIPS, volume 33, pages 7462–7473, 2020

work page 2020
[25]

And the bit goes down: Revisiting the quantization of neural networks

Pierre Stock, Armand Joulin, R´ emi Gribonval, Benjamin Graham, and Herv´ e J´ egou. And the bit goes down: Revisiting the quantization of neural networks. InICLR, 2020. 14

work page 2020
[26]

Efficient processing of deep neural networks: A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017

work page 2017
[27]

PRKAN: Parameter-Reduced Kolmogorov-Arnold Networks.arXiv preprint arXiv:2501.07032, 2025

Hoang-Thang Ta, Duy-Quy Thai, et al. PRKAN: Parameter-Reduced Kolmogorov-Arnold Networks.arXiv preprint arXiv:2501.07032, 2025

work page arXiv 2025
[28]

Mnasnet: Platform-aware neural architecture search for mobile

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. InCVPR, pages 2820–2828, 2019. 15 A Additional Experimental Details A.1 Hyperparameters All KAN models use the following configuration: •Spline basis: Cubic B-splines (k= 3) •Grid size:G= 10 point...

work page 2019