pith. sign in

arxiv: 2605.19842 · v1 · pith:VZDOUBHRnew · submitted 2026-05-19 · 💻 cs.LG

Fast Tensorization of Neural Networks via Slice-wise Feature Distillation

Pith reviewed 2026-05-20 06:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords neural network compressiontensor decompositionfeature distillationslice-wise optimizationmodel tensorizationResNet-34GPT-2distributed compression
0
0 comments X

The pith

Tensorizing neural network slices independently via local feature matching achieves near-lossless compression without global fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to compress neural networks by splitting them into slices made of single layers, blocks, or small groups of consecutive layers and then tensorizing each slice on its own. For every slice the tensor decomposition is trained to reproduce the exact intermediate outputs that the original pretrained network generates at that location. This replaces the usual requirement of jointly optimizing the entire decomposed network after the fact. The local approach recovers accuracy more readily, needs less training data, and supports parallel work on separate slices. Results on ResNet-34 and GPT-2 XL indicate faster convergence and better final performance than conventional global tensorization at comparable compression rates.

Core claim

Decomposing a pretrained network into slices and independently tensorizing each slice so that it reproduces the original intermediate representations allows scalable compression with higher accuracy recovery, lower data needs, and faster optimization than global tensorization methods that require costly end-to-end fine-tuning after decomposition.

What carries the argument

Slice-wise feature distillation, the process of breaking the network into slices and optimizing each slice's tensor factors separately to match the pretrained model's local intermediate activations.

Load-bearing premise

Independently making each slice reproduce its local intermediate representations is sufficient to preserve the network's overall performance on the target task without any later joint optimization across slices.

What would settle it

If independent slice tensorization produces a large drop in end-to-end task accuracy relative to the original model or to a globally fine-tuned tensorized version at the same compression rate, the central claim would be disproved.

Figures

Figures reproduced from arXiv: 2605.19842 by Rom\'an Or\'us, Safa Hamreras, Sukhbinder Singh.

Figure 1
Figure 1. Figure 1: The process of slice-wise feature distillation: (1) First split a pretrained [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between local and global tensorization of 3 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of global and local tensorization methods as a function of [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MPO decomposition of W into N tensors Algorithm 1 MPO Decomposition of a pretrained weight matrix W 1: Inputs: (1) Weight matrix W, (2) input indices {i1, . . . , iN }, (3) output indices {j1, . . . , jN }, (4) bond dimensions {|χ1|, |χ2|, . . . , |χN |}. 2: Output: MPO[1..N]: A list of truncated reshaped U tensors. 3: Initialize: 4: A = reshape(W,(|i1||j1|, QN k=2 |ik||jk|)) 5: n = 2 6: while n < N do 7: … view at source ↗
Figure 5
Figure 5. Figure 5: Tucker decomposition of a 4D convolutional kernel. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of individual layers with respect to test accuracy. For a convo [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

We propose a scalable tensorization framework for neural network compression based on slice-wise feature distillation. Unlike conventional tensor decomposition methods that rely on costly global finetuning, our approach decomposes the network into slices consisting of either individual layers or blocks (e.g., convolutional layers or MLPs), or small groups of consecutive layers, and tensorizes each slice independently to reproduce the intermediate representations of the original pretrained model. This modular strategy improves accuracy recovery, reduces data requirements, and enables efficient parallel optimization. Experiments on ResNet-34 show significant gains over conventional global tensorization, achieving near-lossless compression at moderate compression rates with faster optimization. Results on GPT-2 XL further demonstrate the scalability of the method and its applicability to large-scale models, particularly in distributed settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a scalable tensorization framework for neural network compression based on slice-wise feature distillation. The network is decomposed into slices consisting of individual layers, blocks, or small groups of consecutive layers, and each slice is tensorized independently to reproduce the intermediate representations of the original pretrained model. This modular strategy is claimed to improve accuracy recovery, reduce data requirements, enable parallel optimization, and achieve near-lossless compression without global fine-tuning. Experiments on ResNet-34 report significant gains over conventional global tensorization at moderate compression rates with faster optimization, while results on GPT-2 XL demonstrate scalability to large models in distributed settings.

Significance. If the empirical results hold under rigorous verification, the slice-wise approach could offer a practical advance in compressing large neural networks by avoiding the computational expense of global fine-tuning and supporting parallel/distributed optimization. This addresses scalability limitations of traditional tensor decomposition methods for models like ResNet and transformers.

major comments (2)
  1. Abstract: The reported experimental gains on ResNet-34 and GPT-2 XL, including claims of near-lossless compression and significant improvements over global tensorization, provide no details on baselines, error bars, data splits, or statistical significance. This makes it impossible to assess whether the performance claims are robust.
  2. Method (slice-wise distillation description): The central claim that independent tensorization of each slice to match local intermediate activations preserves end-to-end task performance without any global fine-tuning relies on an unanalyzed assumption. No bounds or analysis are given on how residual approximation errors might accumulate or shift input distributions to downstream slices, which is load-bearing for the no-fine-tuning headline result.
minor comments (1)
  1. Consider including a diagram or pseudocode in the method section to illustrate the slice decomposition, distillation objective, and how slices are composed back into the full network.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point by point below, indicating the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: Abstract: The reported experimental gains on ResNet-34 and GPT-2 XL, including claims of near-lossless compression and significant improvements over global tensorization, provide no details on baselines, error bars, data splits, or statistical significance. This makes it impossible to assess whether the performance claims are robust.

    Authors: We agree that the abstract would be strengthened by including these details. In the revised version we will expand the abstract to name the primary baselines (standard global CP and Tucker decompositions), state that all reported numbers are means over five independent runs with standard deviations shown as error bars, specify the evaluation protocols (ImageNet validation set for ResNet-34 and WikiText-103 for GPT-2 XL), and note that the observed improvements pass a paired t-test at p < 0.05. These additions will be kept concise while directing readers to the experimental section for full tables and statistical details. revision: yes

  2. Referee: Method (slice-wise distillation description): The central claim that independent tensorization of each slice to match local intermediate activations preserves end-to-end task performance without any global fine-tuning relies on an unanalyzed assumption. No bounds or analysis are given on how residual approximation errors might accumulate or shift input distributions to downstream slices, which is load-bearing for the no-fine-tuning headline result.

    Authors: We acknowledge that the manuscript currently lacks a formal analysis of error propagation. We have added a new paragraph in Section 3.2 that (i) explains why local feature matching limits distribution shift (each slice is optimized to reproduce the exact intermediate activations seen by the next slice), (ii) reports measured L2 reconstruction errors and cosine similarities between original and tensorized slice outputs across all layers, and (iii) includes an ablation that progressively replaces slices with their tensorized versions while tracking end-to-end accuracy. These empirical results show that reconstruction errors remain small and do not compound to degrade final task performance. Deriving rigorous theoretical bounds on accumulation is an interesting open direction that we now flag explicitly as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical engineering method with independent experimental validation

full rationale

The paper proposes an empirical compression technique that decomposes networks into slices and performs independent tensorization via feature distillation to match local intermediate activations. All performance claims (near-lossless compression on ResNet-34, scalability on GPT-2 XL) rest on reported experimental outcomes rather than any derivation, equation, or fitted parameter that reduces by construction to quantities measured on the same evaluation data. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided text; the central assumption that local matching suffices for end-to-end behavior is presented as an empirical hypothesis tested by results, not as a tautology. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that local feature matching suffices for global performance.

pith-pipeline@v0.9.0 · 5661 in / 1061 out tokens · 23842 ms · 2026-05-20T06:49:17.519695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 5 internal anchors

  1. [1]

    Tensorizing neural networks.Advances in neural information processing systems, 28, 2015

    Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks.Advances in neural information processing systems, 28, 2015

  2. [2]

    The density-matrix renormalization group in the age of matrix product states.Annals of physics, 326(1):96–192, 2011

    Ulrich Schollwöck. The density-matrix renormalization group in the age of matrix product states.Annals of physics, 326(1):96–192, 2011

  3. [3]

    Some mathematical notes on three-mode factor analysis.Psy- chometrika, 31(3):279–311, 1966

    Ledyard R Tucker. Some mathematical notes on three-mode factor analysis.Psy- chometrika, 31(3):279–311, 1966

  4. [4]

    A tensorized transformer for language modeling.Advances in neural information processing systems, 32, 2019

    Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Ming Zhou, and Dawei Song. A tensorized transformer for language modeling.Advances in neural information processing systems, 32, 2019

  5. [5]

    Deep neural network compression by tucker decompo- sition with nonlinear response.Knowledge-based systems, 241:108171, 2022

    Ye Liu and Michael K Ng. Deep neural network compression by tucker decompo- sition with nonlinear response.Knowledge-based systems, 241:108171, 2022

  6. [6]

    Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition.arXiv e-prints, pages arXiv–2307, 2023

    Mingxue Xu, Yao Lei Xu, and Danilo P Mandic. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition.arXiv e-prints, pages arXiv–2307, 2023

  7. [7]

    Compactifai: extreme compression of large language models using quantum-inspired tensor networks.arXiv preprint arXiv:2401.14109, 2024

    Andrei Tomut, Saeed S Jahromi, Abhijoy Sarkar, Uygar Kurt, Sukhbinder Singh, Faysal Ishtiaq, Cesar Muñoz, Prabdeep Singh Bajaj, Ali Elborady, Gianni del Bimbo, et al. Compactifai: extreme compression of large language models using quantum-inspired tensor networks.arXiv preprint arXiv:2401.14109, 2024

  8. [8]

    Tensorization is a powerful but underexplored tool for compression and interpretability of neural networks

    Safa Hamreras, Sukhbinder Singh, and Román Orús. Tensorization is a powerful but underexplored tool for compression and interpretability of neural networks. arXiv preprint arXiv:2505.20132, 2025

  9. [9]

    The singular value decomposition: Its computation and some applications.IEEE Transactions on automatic control, 25(2):164–176, 1980

    Virginia Klema and Alan Laub. The singular value decomposition: Its computation and some applications.IEEE Transactions on automatic control, 25(2):164–176, 1980

  10. [10]

    arXiv preprint arXiv:2207.00112 , year=

    Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022

  11. [11]

    Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

    Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Vic- tor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp- decomposition.arXiv preprint arXiv:1412.6553, 2014

  12. [12]

    Liu, Z.-F

    Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Zhi-Yuan Xie, Zhong-Yi Lu, and Ji- Rong Wen. Enabling lightweight fine-tuning for pre-trained language model com- pression based on matrix product operators.arXiv preprint arXiv:2106.02205, 2021

  13. [13]

    Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of Machine Learning and Systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion.Proceedings of Machine Learning and Systems, 6:87–100, 2024

  14. [14]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019. Fast Tensorization of Neural Networks via Slice-wise Feature Distillation 15

  15. [15]

    Matrix Product State Representations

    David Perez-Garcia, Frank Verstraete, Michael M Wolf, and J Ignacio Cirac. Ma- trix product state representations.arXiv preprint quant-ph/0608197, 2006

  16. [16]

    Yulei Wang, Hongzhou Wang, Enyu Zhao, Meiping Song, and Chunhui Zhao. Tucker decomposition-based network compression for anomaly detection with large-scale hyperspectral images.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17:10674–10689, 2024

  17. [17]

    Cp-decomposition with tensor power method for convolutional neural networks compression

    Marcella Astrid and Seung-Ik Lee. Cp-decomposition with tensor power method for convolutional neural networks compression. In2017 IEEE International Con- ference on Big Data and Smart Computing (BigComp),pages115–118.IEEE,2017

  18. [18]

    An effi- cient tensor-based transformer for industrial internet of things.IEEE Transactions on Network Science and Engineering, 2023

    Debin Liu, Laurence T Yang, Ruonan Zhao, Jinhua Cui, and Xiangli Yang. An effi- cient tensor-based transformer for industrial internet of things.IEEE Transactions on Network Science and Engineering, 2023

  19. [19]

    High-order pooling for graph neural networks with tensor decomposition.Advances in Neural Information Pro- cessing Systems, 35:6021–6033, 2022

    Chenqing Hua, Guillaume Rabusseau, and Jian Tang. High-order pooling for graph neural networks with tensor decomposition.Advances in Neural Information Pro- cessing Systems, 35:6021–6033, 2022

  20. [20]

    Wen-YuanLiu,Si-JingDu,RuojingPeng,JohnnieGray,andGarnetKin-LicChan. Tensor network computations that capture strict variationality, volume law behav- ior, and the efficient representation of neural network states.Physical Review Letters, 133(26):260404, 2024

  21. [21]

    Model compression via distillation and quantization

    Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via dis- tillation and quantization.arXiv preprint arXiv:1802.05668, 2018

  22. [22]

    Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789– 1819, 2021

    Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789– 1819, 2021

  23. [23]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  24. [24]

    Heterogeneous knowledge distillation using information flow modeling

    Nikolaos Passalis, Maria Tzelepi, and Anastasios Tefas. Heterogeneous knowledge distillation using information flow modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2339–2348, 2020

  25. [25]

    Bert: Pre- training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. InProceed- ings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  26. [26]

    Deep neural network quan- tization via layer-wise optimization using limited training data

    Shangyu Chen, Wenya Wang, and Sinno Jialin Pan. Deep neural network quan- tization via layer-wise optimization using limited training data. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3329–3336, 2019

  27. [27]

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in Neural Information Processing Systems, 35:27168–27183, 2022

    Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in Neural Information Processing Systems, 35:27168–27183, 2022

  28. [28]

    Optimal brain compression: A framework for accurate post-training quantization and pruning.Advances in Neural Information Processing Systems, 35:4475–4488, 2022

    Elias Frantar and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning.Advances in Neural Information Processing Systems, 35:4475–4488, 2022

  29. [29]

    Optq: Accurate quantization for generative pre-trained transformers

    E Frantar, S Ashkboos, T Hoefler, and D Alistarh. Optq: Accurate quantization for generative pre-trained transformers. 2023. InURL https://openreview. net/forum

  30. [30]

    Pruning vs quantization: Which is better?Advances in neural infor- mation processing systems, 36:62414–62427, 2023

    Andrey Kuzmin, Markus Nagel, Mart Van Baalen, Arash Behboodi, and Tijmen Blankevoort. Pruning vs quantization: Which is better?Advances in neural infor- mation processing systems, 36:62414–62427, 2023. 16 S. Hamreras et al

  31. [31]

    ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

    Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821, 2023

  32. [32]

    Svd-llm: Truncation-aware singular value decomposition for large language model compression

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024

  33. [33]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  34. [34]

    Apssf: adaptive cnn pruning based on structural similarity of filters.International Journal of Computational Intelligence Systems, 17(1):129, 2024

    Lili Geng and Baoning Niu. Apssf: adaptive cnn pruning based on structural similarity of filters.International Journal of Computational Intelligence Systems, 17(1):129, 2024

  35. [35]

    Automatic group-based structured pruning for deep convolutional networks.IEEE Access, 10:128824–128834, 2022

    Hang Wei, Zulin Wang, Gengxin Hua, Jinjing Sun, and Yunfu Zhao. Automatic group-based structured pruning for deep convolutional networks.IEEE Access, 10:128824–128834, 2022

  36. [36]

    Deep con- volutional neural network compression via coupled tensor decomposition.IEEE Journal of Selected Topics in Signal Processing, 15(3):603–616, 2020

    Weize Sun, Shaowu Chen, Lei Huang, Hing Cheung So, and Min Xie. Deep con- volutional neural network compression via coupled tensor decomposition.IEEE Journal of Selected Topics in Signal Processing, 15(3):603–616, 2020

  37. [37]

    Joint matrix decom- position for deep convolutional neural networks compression.Neurocomputing, 516:11–26, 2023

    Shaowu Chen, Jiahao Zhou, Weize Sun, and Lei Huang. Joint matrix decom- position for deep convolutional neural networks compression.Neurocomputing, 516:11–26, 2023

  38. [38]

    Edropout: Energy-based dropout and pruning of deep neural networks.IEEE Transactions on Neural Networks and Learning Systems, 33(10):5279–5292, 2021

    Hojjat Salehinejad and Shahrokh Valaee. Edropout: Energy-based dropout and pruning of deep neural networks.IEEE Transactions on Neural Networks and Learning Systems, 33(10):5279–5292, 2021

  39. [39]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  40. [40]

    Compressing deep neural networks by matrix product operators.Physical Review Research, 2(2):023300, 2020

    Ze-Feng Gao, Song Cheng, Rong-Qiang He, Zhi-Yuan Xie, Hui-Hai Zhao, Zhong- Yi Lu, and Tao Xiang. Compressing deep neural networks by matrix product operators.Physical Review Research, 2(2):023300, 2020

  41. [41]

    An improved deep computation model based on canonical polyadic decomposition.IEEE Transac- tions on Systems, Man, and Cybernetics: Systems, 48(10):1657–1666, 2017

    Qingchen Zhang, Laurence T Yang, Zhikui Chen, and Peng Li. An improved deep computation model based on canonical polyadic decomposition.IEEE Transac- tions on Systems, Man, and Cybernetics: Systems, 48(10):1657–1666, 2017

  42. [42]

    Neural network compression based on tensor ring decomposition.IEEE Trans- actions on Neural Networks and Learning Systems, 2024

    Kun Xie, Can Liu, Xin Wang, Xiaocan Li, Gaogang Xie, Jigang Wen, and Kenli Li. Neural network compression based on tensor ring decomposition.IEEE Trans- actions on Neural Networks and Learning Systems, 2024

  43. [43]

    Learning compact recurrent neural networks with block-term tensor decomposition

    Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning compact recurrent neural networks with block-term tensor decomposition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9378–9387, 2018

  44. [44]

    A practical introduction to tensor networks: Matrix product states and projected entangled pair states.Annals of physics, 349:117–158, 2014

    Román Orús. A practical introduction to tensor networks: Matrix product states and projected entangled pair states.Annals of physics, 349:117–158, 2014

  45. [45]

    Tensor network compress- ibility of convolutional models.arXiv preprint arXiv:2403.14379, 2024

    Sukhbinder Singh, Saeed S Jahromi, and Roman Orus. Tensor network compress- ibility of convolutional models.arXiv preprint arXiv:2403.14379, 2024. A Background: Tensorized Neural Networks A tensorized neural network has at least one tensorized layer—a layer in which theweightmatrixisrepresentedasatensornetworkusingaspecifictensorization Fast Tensorization...