Robust Basis Spline Decoupling for the Compression of Transformer Models

Joppe De Jonghe; Mariya Ishteva; Van Tien Pham

arxiv: 2605.18794 · v1 · pith:SZ2SGMQMnew · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Robust Basis Spline Decoupling for the Compression of Transformer Models

Joppe De Jonghe , Van Tien Pham , Mariya Ishteva This is my paper

Pith reviewed 2026-05-20 21:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords B-spline decouplingtransformer compressioncoupled matrix-tensor factorizationmodel compressionneural network approximationR-CMTF-BSDtensor-based decoupling

0 comments

The pith

B-spline decoupling lets transformers keep competitive accuracy after large parameter cuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a decoupling approach that represents the multivariate mappings inside transformer layers as linear transformations followed by univariate nonlinear functions. It replaces earlier polynomial or piecewise-linear choices with B-splines, which supply local support and tunable smoothness to improve numerical stability and expressiveness. The representation is recast as a constrained coupled matrix-tensor factorization problem and solved by the alternating-least-squares algorithm R-CMTF-BSD that adds normalization and Tikhonov regularization. Experiments on synthetic tensors and on Vision and Swin Transformer weights show that the resulting approximations achieve substantial reductions in total parameters while accuracy remains close to the original models. This structured compression route is presented as a practical way to lighten large neural networks without redesigning their overall architecture.

Core claim

B-spline decoupling generalizes existing tensor-based decoupling methods by using basis splines to parameterize the internal univariate functions, yielding a constrained coupled matrix-tensor factorization that is solved stably by the R-CMTF-BSD algorithm; when applied to real transformer weight tensors the factorization produces compressed models whose accuracy on Vision and Swin architectures stays competitive with the uncompressed originals.

What carries the argument

B-spline decoupling realized through constrained coupled matrix-tensor factorization, with R-CMTF-BSD performing the alternating least-squares updates under normalization and Tikhonov regularization.

If this is right

Vision and Swin Transformer models can be stored and run with substantially fewer parameters.
The same B-spline representation remains stable when the underlying weight tensors contain the typical range of values found in trained transformers.
Decoupling supplies a direct structural link between a single-hidden-layer network and tensor factorization, allowing compression without changing the outer network topology.
The method extends earlier polynomial and piecewise-linear decoupling techniques while inheriting their link to fully connected layers with flexible activations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization could be applied to attention or feed-forward blocks in language transformers to test whether the compression gains transfer beyond vision models.
Because B-splines allow explicit control of smoothness order, the approach might be tuned to preserve low-frequency weight patterns that matter most for generalization.
If the constrained factorization proves robust, it could serve as a drop-in module inside existing model-pruning pipelines that already rely on tensor decompositions.

Load-bearing premise

The local support and smoothness properties of B-splines, together with the constrained factorization and regularized solver, are assumed to produce approximations that transfer from synthetic data to actual transformer weights without large accuracy loss.

What would settle it

Running R-CMTF-BSD on the weights of a pretrained Vision Transformer and observing either numerical instability during factorization or an accuracy drop larger than a few percent after the claimed parameter reduction would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.18794 by Joppe De Jonghe, Mariya Ishteva, Van Tien Pham.

**Figure 2.** Figure 2: General transformer encoder block with attention mechanism followed by channel [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Approximations of the component functions [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Mean top-1 accuracy results for the (top) CMTF-BSD and CMTF-PD algorithms and (bottom) R-CMTF-BSD and R-CMTF-PD algorithms. Each result is over 5 runs for different hyperparameter configurations. Results in row one have λ set to 1e − 6, in row two λ is set to 0.25. The first column contains (R-)CMTF-PD results with polynomials of degree d ∈ {3, 4, 5, 6} (y-axis), columns two, three and four contain (R-)CMT… view at source ↗

**Figure 5.** Figure 5: Mean Error(J ) results for the R-CMTF-BSD and R-CMTF-PD algorithms over 5 runs for different hyperparameter configurations. The figure layout and axis are as mentioned in the caption of [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Mean Error(F) results for the R-CMTF-BSD and R-CMTF-PD algorithms over 5 runs for different hyperparameter configurations. The figure layout and axis are as mentioned in the caption of [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Saving percentage values for the compressed FCNN and used polynomial (left) and [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Mean top-1 accuracy results over 3 runs of the BF and FB compression procedures, discussed in Subsection 6.3, applied to a ViT model using the R-CMTF-PD and R-CMTF-BSD algorithm. Each row corresponds to results on a different dataset, denoted by the row title. The first column corresponds to polynomial activations of degree 3, the remaining columns correspond to the used B-spline activations of degree 1, 2… view at source ↗

**Figure 9.** Figure 9: Mean top-1 accuracy results over 3 runs of the BF and FB compression procedures, discussed in Subsection 6.3, applied to a Swin model using the R-CMTF-PD and R-CMTFBSD algorithm. The figure layout and axis are as described in the caption of [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

Decoupling is a powerful modeling paradigm for representing multivariate functions as compositions of linear transformations and univariate nonlinear functions. A single-layer decoupling can be viewed as a fully connected neural network with a single hidden layer and flexible activation functions, providing a direct link with neural networks. Because of this, the use of decoupling methods has gained increasing attention in neural network domains, particularly compression, since it enables structured approximations with reduced parameter complexity. Existing tensor-based decoupling methods typically rely on polynomial or piecewise-linear parameterizations of the internal nonlinear functions, which can suffer from numerical instability or limited expressiveness. In this work, we introduce a B-spline-based decoupling framework that generalizes these existing approaches. By exploiting the local support and flexible smoothness control of B-splines, the proposed formulation yields a more numerically stable and expressive representation. We derive a constrained coupled matrix-tensor factorization and propose a robust alternating least-squares algorithm, called R-CMTF-BSD, incorporating normalization and Tikhonov regularization. The proposed method is validated through experiments on synthetic data and transformer model compression. Results on the Vision and Swin Transformer architectures demonstrate that B-spline decoupling enables substantial parameter reduction while maintaining competitive accuracy, making the R-CMTF-BSD algorithm a promising tool for structured neural network compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a B-spline parameterization for the univariate nonlinear functions within a decoupling model, which is linked to single-hidden-layer networks. It derives a constrained coupled matrix-tensor factorization (CMTF) and develops the R-CMTF-BSD alternating least-squares algorithm that incorporates normalization and Tikhonov regularization. The approach is positioned as more stable and expressive than polynomial or piecewise-linear alternatives. Validation is reported on synthetic data together with compression experiments on Vision and Swin Transformer architectures, where the method is claimed to achieve substantial parameter reduction while preserving competitive accuracy.

Significance. If the empirical claims hold after the addition of necessary controls, the work would supply a numerically stable tensor-factorization route to structured compression of transformer weights. The exploitation of B-spline local support and smoothness control, together with the robust optimization procedure, addresses documented weaknesses of earlier decoupling formulations and could be useful for parameter-efficient model deployment.

major comments (2)

[Experimental validation] Experimental validation section: no head-to-head comparison is presented between B-spline decoupling and the polynomial or piecewise-linear baselines on identical Vision or Swin Transformer layers. Without these controls it is impossible to determine whether the reported competitive accuracy stems from the B-spline choice or from the R-CMTF-BSD optimizer itself.
[Method] R-CMTF-BSD algorithm description: the manuscript provides no ablation or sensitivity study on the regularization parameter, knot placement, or spline order. Given that the central claim rests on the stability and expressiveness gained from these B-spline properties, the absence of such analysis leaves the contribution of the new parameterization unquantified.

minor comments (2)

The abstract would be strengthened by the inclusion of at least one concrete quantitative result (e.g., parameter reduction percentage and top-1 accuracy delta) rather than the qualitative statement 'substantial parameter reduction while maintaining competitive accuracy.'
Notation for the constrained coupled factorization and the normalization step inside R-CMTF-BSD should be introduced with an explicit equation reference to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and indicate the planned revisions to the manuscript.

read point-by-point responses

Referee: [Experimental validation] Experimental validation section: no head-to-head comparison is presented between B-spline decoupling and the polynomial or piecewise-linear baselines on identical Vision or Swin Transformer layers. Without these controls it is impossible to determine whether the reported competitive accuracy stems from the B-spline choice or from the R-CMTF-BSD optimizer itself.

Authors: We agree that head-to-head comparisons on identical transformer layers would more clearly isolate the contribution of the B-spline parameterization versus the optimizer. The current manuscript demonstrates B-spline advantages primarily through synthetic experiments that compare stability and expressiveness against polynomial and piecewise-linear alternatives, while the transformer results focus on overall compression performance relative to the uncompressed models. To address this gap, we will add direct comparisons of B-spline, polynomial, and piecewise-linear decoupling on selected layers from the Vision and Swin Transformers, all optimized with the R-CMTF-BSD procedure. These results will be reported in the revised experimental section. revision: yes
Referee: [Method] R-CMTF-BSD algorithm description: the manuscript provides no ablation or sensitivity study on the regularization parameter, knot placement, or spline order. Given that the central claim rests on the stability and expressiveness gained from these B-spline properties, the absence of such analysis leaves the contribution of the new parameterization unquantified.

Authors: We concur that systematic sensitivity analyses would better quantify the role of B-spline hyperparameters. The manuscript specifies the regularization strength, knot counts, and spline orders selected for the reported experiments to achieve numerical stability, but does not present ablations. In the revision we will incorporate sensitivity studies that vary the Tikhonov regularization parameter, knot placement strategies, and spline orders, evaluating their effects on convergence behavior, numerical stability, and final compression accuracy for both the synthetic benchmarks and the transformer models. revision: yes

Circularity Check

0 steps flagged

No circularity: B-spline decoupling and R-CMTF-BSD derived independently of target results

full rationale

The paper introduces B-spline parameterization to generalize prior polynomial/piecewise-linear decoupling methods, exploits local support and smoothness properties to form a constrained coupled matrix-tensor factorization, and develops the R-CMTF-BSD alternating least-squares procedure with explicit normalization and Tikhonov regularization. These steps are presented as methodological contributions whose parameters are fitted to weight tensors; the subsequent empirical results on synthetic data and Vision/Swin Transformer compression (parameter reduction with competitive accuracy) are downstream validations rather than quantities forced by construction from the inputs. No self-definitional loop, fitted-input-renamed-as-prediction, or load-bearing self-citation chain appears in the derivation; the central claims rest on the new parameterization and optimization rather than re-expressing the same fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper relies on standard properties of B-splines (local support, smoothness control) and established coupled matrix-tensor factorization techniques. No explicit free parameters, axioms, or invented entities are detailed beyond the introduction of the B-spline decoupling framework itself.

pith-pipeline@v0.9.0 · 5757 in / 1281 out tokens · 59692 ms · 2026-05-20T21:55:26.318528+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a B-spline-based decoupling framework... constrained coupled matrix-tensor factorization and propose a robust alternating least-squares algorithm, called R-CMTF-BSD
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemmas 1-3 establish that polynomial and piecewise-linear bases are strict subspaces of the B-spline space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Methods of information geometry

Amari, S.i., Nagaoka, H., 2000. Methods of information geometry. volume

work page 2000
[2]

American Mathematical Soc

work page
[3]

Ansari, A.F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S.S., Arango, S.P., Kapoor, S., et al.,

work page
[4]

Chronos: Learning the Language of Time Series

Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Au- toencoders and their applications in machine learning: a survey

Berahmand, K., Daneshfar, F., Salehi, E.S., Li, Y., Xu, Y., 2024. Au- toencoders and their applications in machine learning: a survey. Artificial intelligence review 57, 28

work page 2024
[6]

Language models are few-shot learners

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901

work page 2020
[7]

Drone: Data-aware low-rank compression for large nlp models

Chen, P., Yu, H.F., Dhillon, I., Hsieh, C.J., 2021. Drone: Data-aware low-rank compression for large nlp models. Advances in neural information processing systems 34, 29321–29334

work page 2021
[8]

A practical guide to splines

De Boor, C., 1978. A practical guide to splines. volume 27. springer New York. 24

work page 1978
[9]

Non-parametric B-spline decoupling of multivariate functions, in: 2025 33rd European Signal Processing Confer- ence (EUSIPCO), IEEE

De Jonghe, J., Ishteva, M., 2025. Non-parametric B-spline decoupling of multivariate functions, in: 2025 33rd European Signal Processing Confer- ence (EUSIPCO), IEEE. pp. 2117–2121

work page 2025
[10]

Tensor-based Multi-layer Decoupling

De Jonghe, J., Usevich, K., Dreesen, P., Ishteva, M., 2026. Tensor-based multi-layer decoupling. arXiv preprint arXiv:2604.10858

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Decoupling multivariate functions using a nonparametric filtered tensor decomposition

Decuyper, J., Tiels, K., Weiland, S., Runacres, M.C., Schoukens, J., 2022. Decoupling multivariate functions using a nonparametric filtered tensor decomposition. Mechanical Systems and Signal Processing 179, 109328

work page 2022
[12]

Themnistdatabaseofhandwrittendigitimagesformachine learning research

Deng, L., 2012. Themnistdatabaseofhandwrittendigitimagesformachine learning research. IEEE Signal Processing Magazine 29, 141–142

work page 2012
[13]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2019. Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceed- ingsofthe2019conferenceoftheNorthAmericanchapteroftheassociation for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186

work page 2019
[14]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un- terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.,

work page
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Decoupling multivariate func- tions using second-order information and tensors, in: International Con- ference on Latent Variable Analysis and Signal Separation, Springer

Dreesen, P., De Geeter, J., Ishteva, M., 2018. Decoupling multivariate func- tions using second-order information and tensors, in: International Con- ference on Latent Variable Analysis and Signal Separation, Springer. pp. 79–88

work page 2018
[17]

Decoupling multivariate polynomialsusing first-orderinformationand tensordecompositions

Dreesen, P., Ishteva, M., Schoukens, J., 2015. Decoupling multivariate polynomialsusing first-orderinformationand tensordecompositions. SIAM J Matrix Analysis and Applications 36, 864–879

work page 2015
[18]

Strategies for applying low rank decomposition to transformer- based models, in: 36th Conference on Neural Information Processing Sys- tems (NeurIPS2022), pp

Hajimolahoseini, H., Ahmed, W., Rezagholizadeh, M., Partovinia, V., Liu, Y., 2022. Strategies for applying low rank decomposition to transformer- based models, in: 36th Conference on Neural Information Processing Sys- tems (NeurIPS2022), pp. 1–6

work page 2022
[19]

Multivariate polynomial decoupling in nonlinear sys- tem identification

Hollander, G., 2017. Multivariate polynomial decoupling in nonlinear sys- tem identification. Ph.D. thesis. Vrije Universiteit Brussel (VUB)

work page 2017
[20]

A database for handwritten text recognition research

Hull, J.J., 1994. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 550– 554

work page 1994
[21]

Tensor decompositions and applications

Kolda, T.G., Bader, B.W., 2009. Tensor decompositions and applications. SIAM review 51, 455–500. 25

work page 2009
[22]

Learning multiple layers of features from tiny images

Krizhevsky, A., Hinton, G., et al., 2009. Learning multiple layers of features from tiny images

work page 2009
[23]

Tensors for Data Processing: Theory, Methods, and Appli- cations

Liu, Y., 2021. Tensors for Data Processing: Theory, Methods, and Appli- cations. Academic Press

work page 2021
[24]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022

work page 2021
[26]

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y., et al.,

work page
[27]

Reading digits in natural images with unsupervised feature learning, in: NIPS workshop on deep learning and unsupervised feature learning, Granada. p. 4

work page
[28]

Efficient tensor decomposition-based filter pruning

Pham, V.T., Zniyed, Y., Nguyen, T.P., 2024. Efficient tensor decomposition-based filter pruning. Neural Networks 178, 106393

work page 2024
[29]

Enhanced network com- pression through tensor decompositions and pruning

Pham, V.T., Zniyed, Y., Nguyen, T.P., 2025a. Enhanced network com- pression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 36, 4358–4370

work page
[30]

Singular values-driven au- tomated filter pruning

Pham, V.T., Zniyed, Y., Nguyen, T.P., 2025b. Singular values-driven au- tomated filter pruning. Neural Networks 192, 107857

work page
[31]

Tensor decomposition for signal processing and machine learning

Sidiropoulos, N.D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E.E., Faloutsos, C., 2017. Tensor decomposition for signal processing and machine learning. IEEE Transactions on signal processing 65, 3551–3582

work page 2017
[32]

Mlp-mixer: An all-mlp architecture for vision

Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Un- terthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al., 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in neural infor- mation processing systems 34, 24261–24272

work page 2021
[33]

Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, pp

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H., 2021. Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, pp. 10347– 10357

work page 2021
[34]

Attention is all you need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30

work page 2017
[35]

Decoupling non- polynomial functions: A neural network example

Westwick, D.T., Decuyper, J., Schoukens, J., 2021. Decoupling non- polynomial functions: A neural network example. IFAC-PapersOnLine 54, 667–672. 26

work page 2021
[36]

A review of recurrent neural networks: LSTM cells and network architectures

Yu, Y., Si, X., Hu, C., Zhang, J., 2019. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation 31, 1235–1270

work page 2019
[37]

A tensor-based approach for training flexible neural networks, in: 2021 55th Asilomar Conference on Signals, Systems, and Computers, IEEE

Zniyed, Y., Usevich, K., Miron, S., Brie, D., 2021. A tensor-based approach for training flexible neural networks, in: 2021 55th Asilomar Conference on Signals, Systems, and Computers, IEEE. pp. 1673–1677. Appendix A. Data and model (hyper)parameters Appendix A.1. Attributes of datasets: Table A.3: Summary of data attributes for the MNIST, SVHN, CIFAR10, ...

work page 2021

[1] [1]

Methods of information geometry

Amari, S.i., Nagaoka, H., 2000. Methods of information geometry. volume

work page 2000

[2] [2]

American Mathematical Soc

work page

[3] [3]

Ansari, A.F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S.S., Arango, S.P., Kapoor, S., et al.,

work page

[4] [4]

Chronos: Learning the Language of Time Series

Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Au- toencoders and their applications in machine learning: a survey

Berahmand, K., Daneshfar, F., Salehi, E.S., Li, Y., Xu, Y., 2024. Au- toencoders and their applications in machine learning: a survey. Artificial intelligence review 57, 28

work page 2024

[6] [6]

Language models are few-shot learners

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901

work page 2020

[7] [7]

Drone: Data-aware low-rank compression for large nlp models

Chen, P., Yu, H.F., Dhillon, I., Hsieh, C.J., 2021. Drone: Data-aware low-rank compression for large nlp models. Advances in neural information processing systems 34, 29321–29334

work page 2021

[8] [8]

A practical guide to splines

De Boor, C., 1978. A practical guide to splines. volume 27. springer New York. 24

work page 1978

[9] [9]

Non-parametric B-spline decoupling of multivariate functions, in: 2025 33rd European Signal Processing Confer- ence (EUSIPCO), IEEE

De Jonghe, J., Ishteva, M., 2025. Non-parametric B-spline decoupling of multivariate functions, in: 2025 33rd European Signal Processing Confer- ence (EUSIPCO), IEEE. pp. 2117–2121

work page 2025

[10] [10]

Tensor-based Multi-layer Decoupling

De Jonghe, J., Usevich, K., Dreesen, P., Ishteva, M., 2026. Tensor-based multi-layer decoupling. arXiv preprint arXiv:2604.10858

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Decoupling multivariate functions using a nonparametric filtered tensor decomposition

Decuyper, J., Tiels, K., Weiland, S., Runacres, M.C., Schoukens, J., 2022. Decoupling multivariate functions using a nonparametric filtered tensor decomposition. Mechanical Systems and Signal Processing 179, 109328

work page 2022

[12] [12]

Themnistdatabaseofhandwrittendigitimagesformachine learning research

Deng, L., 2012. Themnistdatabaseofhandwrittendigitimagesformachine learning research. IEEE Signal Processing Magazine 29, 141–142

work page 2012

[13] [13]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2019. Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceed- ingsofthe2019conferenceoftheNorthAmericanchapteroftheassociation for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186

work page 2019

[14] [14]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un- terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.,

work page

[15] [15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2010

[16] [16]

Decoupling multivariate func- tions using second-order information and tensors, in: International Con- ference on Latent Variable Analysis and Signal Separation, Springer

Dreesen, P., De Geeter, J., Ishteva, M., 2018. Decoupling multivariate func- tions using second-order information and tensors, in: International Con- ference on Latent Variable Analysis and Signal Separation, Springer. pp. 79–88

work page 2018

[17] [17]

Decoupling multivariate polynomialsusing first-orderinformationand tensordecompositions

Dreesen, P., Ishteva, M., Schoukens, J., 2015. Decoupling multivariate polynomialsusing first-orderinformationand tensordecompositions. SIAM J Matrix Analysis and Applications 36, 864–879

work page 2015

[18] [18]

Strategies for applying low rank decomposition to transformer- based models, in: 36th Conference on Neural Information Processing Sys- tems (NeurIPS2022), pp

Hajimolahoseini, H., Ahmed, W., Rezagholizadeh, M., Partovinia, V., Liu, Y., 2022. Strategies for applying low rank decomposition to transformer- based models, in: 36th Conference on Neural Information Processing Sys- tems (NeurIPS2022), pp. 1–6

work page 2022

[19] [19]

Multivariate polynomial decoupling in nonlinear sys- tem identification

Hollander, G., 2017. Multivariate polynomial decoupling in nonlinear sys- tem identification. Ph.D. thesis. Vrije Universiteit Brussel (VUB)

work page 2017

[20] [20]

A database for handwritten text recognition research

Hull, J.J., 1994. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 550– 554

work page 1994

[21] [21]

Tensor decompositions and applications

Kolda, T.G., Bader, B.W., 2009. Tensor decompositions and applications. SIAM review 51, 455–500. 25

work page 2009

[22] [22]

Learning multiple layers of features from tiny images

Krizhevsky, A., Hinton, G., et al., 2009. Learning multiple layers of features from tiny images

work page 2009

[23] [23]

Tensors for Data Processing: Theory, Methods, and Appli- cations

Liu, Y., 2021. Tensors for Data Processing: Theory, Methods, and Appli- cations. Academic Press

work page 2021

[24] [24]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019

[25] [25]

Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022

work page 2021

[26] [26]

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y., et al.,

work page

[27] [27]

Reading digits in natural images with unsupervised feature learning, in: NIPS workshop on deep learning and unsupervised feature learning, Granada. p. 4

work page

[28] [28]

Efficient tensor decomposition-based filter pruning

Pham, V.T., Zniyed, Y., Nguyen, T.P., 2024. Efficient tensor decomposition-based filter pruning. Neural Networks 178, 106393

work page 2024

[29] [29]

Enhanced network com- pression through tensor decompositions and pruning

Pham, V.T., Zniyed, Y., Nguyen, T.P., 2025a. Enhanced network com- pression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 36, 4358–4370

work page

[30] [30]

Singular values-driven au- tomated filter pruning

Pham, V.T., Zniyed, Y., Nguyen, T.P., 2025b. Singular values-driven au- tomated filter pruning. Neural Networks 192, 107857

work page

[31] [31]

Tensor decomposition for signal processing and machine learning

Sidiropoulos, N.D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E.E., Faloutsos, C., 2017. Tensor decomposition for signal processing and machine learning. IEEE Transactions on signal processing 65, 3551–3582

work page 2017

[32] [32]

Mlp-mixer: An all-mlp architecture for vision

Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Un- terthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al., 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in neural infor- mation processing systems 34, 24261–24272

work page 2021

[33] [33]

Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, pp

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H., 2021. Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, pp. 10347– 10357

work page 2021

[34] [34]

Attention is all you need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30

work page 2017

[35] [35]

Decoupling non- polynomial functions: A neural network example

Westwick, D.T., Decuyper, J., Schoukens, J., 2021. Decoupling non- polynomial functions: A neural network example. IFAC-PapersOnLine 54, 667–672. 26

work page 2021

[36] [36]

A review of recurrent neural networks: LSTM cells and network architectures

Yu, Y., Si, X., Hu, C., Zhang, J., 2019. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation 31, 1235–1270

work page 2019

[37] [37]

A tensor-based approach for training flexible neural networks, in: 2021 55th Asilomar Conference on Signals, Systems, and Computers, IEEE

Zniyed, Y., Usevich, K., Miron, S., Brie, D., 2021. A tensor-based approach for training flexible neural networks, in: 2021 55th Asilomar Conference on Signals, Systems, and Computers, IEEE. pp. 1673–1677. Appendix A. Data and model (hyper)parameters Appendix A.1. Attributes of datasets: Table A.3: Summary of data attributes for the MNIST, SVHN, CIFAR10, ...

work page 2021