pith. sign in

arxiv: 1907.01522 · v1 · pith:MF3OBMYCnew · submitted 2019-06-28 · 📡 eess.SP · cs.AR

Tucker Tensor Decomposition on FPGA

Pith reviewed 2026-05-25 13:01 UTC · model grok-4.3

classification 📡 eess.SP cs.AR
keywords Tucker decompositionFPGA acceleratortensor computationhardware optimizationMRI data processingfixed-point arithmeticJacobi SVDspeedup evaluation
0
0 comments X

The pith

FPGA hardware accelerator for Tucker decomposition delivers 2.16-30.2x speedup over CPU and GPU on cardiac MRI data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds and evaluates an FPGA implementation of Tucker tensor decomposition to bring high-dimensional tensor methods to resource-limited hardware. It breaks the algorithm into three core modules—tensor-times-matrix multiplication, singular value decomposition via warm-start Jacobi iterations, and tensor permutation—and realizes them in fixed-point arithmetic. Tests on synthetic tensors and a real cardiac MRI dataset show the design runs substantially faster than established software libraries on both CPUs and GPUs. The work therefore claims that dedicated hardware can make classical tensor decompositions practical for embedded or real-time scientific applications.

Core claim

We present an FPGA-based hardware accelerator for Tucker decomposition that implements TTM, SVD with warm-start Jacobi iterations, and permutation operations in fixed-point. On a cardiac MRI dataset, this achieves 2.16 to 30.2 times speedup over state-of-the-art software toolboxes on CPU and GPU while maintaining useful numerical accuracy.

What carries the argument

FPGA architecture with dedicated modules for tensor-times-matrix multiplication, warm-start Jacobi SVD, and tensor permutation, evaluated through fixed-point simulation.

If this is right

  • Tensor decompositions become feasible inside power- or size-constrained medical imaging devices.
  • Warm-start Jacobi iterations reduce iteration count in hardware SVD, shortening overall runtime.
  • Fixed-point designs lower resource usage on FPGAs compared with floating-point alternatives.
  • The modular breakdown allows reuse of TTM and permutation blocks for related tensor algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fixed-point FPGA blocks could be retargeted to other tensor factorizations such as CP decomposition.
  • Power measurements on the FPGA would likely show lower energy per decomposition than GPU baselines.
  • Scaling the design to larger tensors would depend on memory bandwidth rather than arithmetic throughput.
  • Warm-start techniques may transfer to other iterative linear-algebra kernels in hardware.

Load-bearing premise

Fixed-point arithmetic preserves enough numerical accuracy that the resulting Tucker factors remain useful on real data such as MRI without large degradation relative to floating-point baselines.

What would settle it

Run the fixed-point FPGA implementation and a double-precision reference on the same cardiac MRI dataset; if the relative reconstruction error or downstream analysis quality differs by more than a few percent, the performance claim does not hold.

Figures

Figures reproduced from arXiv: 1907.01522 by Kaiqi Zhang, Xiyuan Zhang, Zheng Zhang.

Figure 1
Figure 1. Figure 1: Left to right: a tensor, slices, and fibers. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Tensor unfolding. Tensor permutation: Tensor permutation changes the mode order of a tensor. It is a high-order extension of the matrix transpose. For instance, given X ∈ R 5×10×3 , permute(X , [2, 3, 1]) generates a new tensor Y ∈ R 10×3×5 with yi2,i3,i1 = xi1,i2,i3 . Unfolding: Unfolding (or matricization) as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: The TTM unit. The red part is used when computing [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall structure of our Tucker decomposition. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) The details of a PE. The red part is used only for [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Block diagram of SVD unit. ACC: accumulator. FIFO: [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Convergence speed (measured as the total number of [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Runtime and convergence of HOOI on some randomly generated 3-way tensors with size [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: When the size along each dimension is 256, MATLAB [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Decomposition result of MRI dataset. Top: original [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
read the original abstract

Tensor computation has emerged as a powerful mathematical tool for solving high-dimensional and/or extreme-scale problems in science and engineering. The last decade has witnessed tremendous advancement of tensor computation and its applications in machine learning and big data. However, its hardware optimization on resource-constrained devices remains an (almost) unexplored field. This paper presents an hardware accelerator for a classical tensor computation framework, Tucker decomposition. We study three modules of this architecture: tensor-times-matrix (TTM), matrix singular value decomposition (SVD), and tensor permutation, and implemented them on Xilinx FPGA for prototyping. In order to further reduce the computing time, a warm-start algorithm for the Jacobi iterations in SVD is proposed. A fixed-point simulator is used to evaluate the performance of our design. Some synthetic data sets and a real MRI data set are used to validate the design and evaluate its performance. We compare our work with state-of-the-art software toolboxes running on both CPU and GPU, and our work shows 2.16 - 30.2x speedup on the cardiac MRI data set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents an FPGA hardware accelerator for Tucker tensor decomposition, implementing tensor-times-matrix (TTM), matrix SVD (with a proposed warm-start Jacobi algorithm), and tensor permutation modules in fixed-point arithmetic on Xilinx FPGA. It evaluates the design using synthetic datasets and a real cardiac MRI dataset, claiming 2.16–30.2× wall-clock speedups versus state-of-the-art CPU/GPU software toolboxes.

Significance. If the fixed-point implementation is shown to preserve decomposition fidelity on real data, the work would demonstrate a practical route to accelerating tensor methods on resource-constrained hardware, which remains underexplored relative to software toolboxes.

major comments (2)
  1. [Abstract] Abstract: The central speedup claim (2.16–30.2× on cardiac MRI) is presented without any accompanying accuracy metrics—such as relative Frobenius residual, core-tensor difference, or factor orthogonality—comparing the fixed-point FPGA output to double-precision baselines on the same MRI data. This omission prevents verification that the reported wall-clock advantage corresponds to a numerically equivalent result.
  2. [Abstract] Abstract and validation section: The fixed-point simulator and hardware pipeline (TTM/SVD/permutation) are described, yet no quantitative comparison of reconstruction quality or factor accuracy versus floating-point references is supplied for the MRI dataset, leaving the weakest assumption (that fixed-point preserves utility) untested in the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for explicit accuracy validation of the fixed-point design on the cardiac MRI dataset. We agree this strengthens the paper and will incorporate the requested metrics in revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central speedup claim (2.16–30.2× on cardiac MRI) is presented without any accompanying accuracy metrics—such as relative Frobenius residual, core-tensor difference, or factor orthogonality—comparing the fixed-point FPGA output to double-precision baselines on the same MRI data. This omission prevents verification that the reported wall-clock advantage corresponds to a numerically equivalent result.

    Authors: We agree that the abstract and results should include these metrics to substantiate numerical equivalence. The fixed-point simulator was used for all experiments, and we will add relative Frobenius residual, core-tensor difference, and factor orthogonality comparisons between the fixed-point outputs and double-precision baselines specifically for the cardiac MRI data in both the abstract and results sections. revision: yes

  2. Referee: [Abstract] Abstract and validation section: The fixed-point simulator and hardware pipeline (TTM/SVD/permutation) are described, yet no quantitative comparison of reconstruction quality or factor accuracy versus floating-point references is supplied for the MRI dataset, leaving the weakest assumption (that fixed-point preserves utility) untested in the reported experiments.

    Authors: We concur that the validation section requires explicit quantitative accuracy results for the MRI dataset. While synthetic datasets received some accuracy checks, the MRI experiments emphasized runtime. In revision we will report reconstruction quality (e.g., relative residual) and factor accuracy metrics versus floating-point references for the MRI case, confirming that the fixed-point design preserves utility at the chosen bit widths. revision: yes

Circularity Check

0 steps flagged

No circularity; speedup claims rest on external benchmarks

full rationale

The paper reports wall-clock speedups (2.16–30.2×) obtained by direct timing of the FPGA design against independent CPU/GPU Tucker toolboxes on cardiac MRI and synthetic data. No equations, fitted parameters, or self-citations are used to derive the performance numbers; the result is an empirical measurement against external baselines. The fixed-point accuracy question is a separate correctness concern and does not create circularity in the reported claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an engineering implementation relying on standard assumptions about FPGA resource mapping and fixed-point arithmetic behavior; no new free parameters, axioms, or invented entities are introduced beyond conventional hardware design practice.

pith-pipeline@v0.9.0 · 5707 in / 1046 out tokens · 53702 ms · 2026-05-25T13:01:58.426243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

  1. [1]

    Tensor decompositions and applications,

    T. Kolda and B. Bader, “Tensor decompositions and applications,” SIAM Rev, vol. 51, no. 3, pp. 455–500, 2009

  2. [2]

    The expression of a tensor or a polyadic as a sum of products,

    F. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” J. Math. Phys. , vol. 6, no. 1-4, pp. 164–189, 1927

  3. [3]

    Some mathematical notes on three-mode factor analysis,

    L. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966

  4. [4]

    Tensor-train decomposition,

    I. V . Oseledets, “Tensor-train decomposition,” SIAM Journal Sci. Comp. , vol. 33, no. 5, pp. 2295–2317, 2011

  5. [5]

    Scalable tensor decompositions for multi-aspect data mining,

    T. G. Kolda and J. Sun, “Scalable tensor decompositions for multi-aspect data mining,” in Proc. IEEE Int. Conf. Data Mining , 2008, pp. 363–372

  6. [6]

    Bayesian Tensorized Neural Networks with Automatic Rank Selection

    C. Hawkins and Z. Zhang, “Bayesian tensorized neural networks with automatic rank selection,” arXiv preprint arXiv:1905.10478 , 2019

  7. [7]

    A compact CNN-DBLSTM based character model for offline handwrit- ing recognition with tucker decomposition,

    H. Ding, K. Chen, Y . Yuan, M. Cai, L. Sun, S. Liang, and Q. Huo, “A compact CNN-DBLSTM based character model for offline handwrit- ing recognition with tucker decomposition,” in Proc. IEEE Int. Conf. Document Analysis and Recognition , vol. 1, 2017, pp. 507–512

  8. [8]

    Tensor-factorized neural networks,

    J.-T. Chien and Y .-T. Bao, “Tensor-factorized neural networks,” IEEE Trans. Neur . Networks Learn. Syst., vol. 29, no. 5, pp. 1998–2011, 2018

  9. [9]

    Tensorizing neural networks,

    A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” in NIPS, 2015, pp. 442–450

  10. [10]

    Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

    V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempit- sky, “Speeding-up convolutional neural networks using fine-tuned CP- decomposition,” arXiv preprint arXiv:1412.6553 , 2014

  11. [11]

    Tensor-train recurrent neural networks for video classification,

    Y . Yang, D. Krompass, and V . Tresp, “Tensor-train recurrent neural networks for video classification,” inProc. Int. Conf. Machine Learning , 2017, pp. 3891–3900

  12. [12]

    Enabling high-dimensional hierarchical uncertainty quantification by anova and tensor-train decomposition,

    Z. Zhang, X. Yang, I. V . Oseledets, G. E. Karniadakis, and L. Daniel, “Enabling high-dimensional hierarchical uncertainty quantification by anova and tensor-train decomposition,” IEEE Trans. CAD of Integrated Circuits and Systems , vol. 34, no. 1, pp. 63–76, 2015

  13. [13]

    Big-data tensor recovery for high-dimensional uncertainty quantification of process variations,

    Z. Zhang, T.-W. Weng, and L. Daniel, “Big-data tensor recovery for high-dimensional uncertainty quantification of process variations,” IEEE Trans. Comp., Pack. Manuf. Tech. , vol. 7, no. 5, pp. 687–697, 2017

  14. [14]

    Dynamic mri reconstruction using low rank plus sparse tensor decomposition,

    S. F. Roohi, D. Zonoobi, A. A. Kassim, and J. L. Jaremko, “Dynamic mri reconstruction using low rank plus sparse tensor decomposition,” in Proc. Int. Conf. Image Process. , 2016, pp. 1769–1773

  15. [15]

    Robust tensor subspace learning for anomaly detection,

    J. Li, G. Han, J. Wen, and X. Gao, “Robust tensor subspace learning for anomaly detection,” Int. J. Machine Learning and Cybernetics , vol. 2, no. 2, pp. 89–98, 2011

  16. [16]

    One-to-many voice conversion based on tensor representation of speaker space,

    D. Saito, K. Yamamoto, N. Minematsu, and K. Hirose, “One-to-many voice conversion based on tensor representation of speaker space,” in Proc. Int. Conf. Speech Comm. Assoc. , 2011

  17. [17]

    High performance parallel algorithms for the Tucker decomposition of sparse tensors,

    O. Kaya and B. Uc ¸ar, “High performance parallel algorithms for the Tucker decomposition of sparse tensors,” inProc. IEEE Int. Conf. Parall. Proc., 2016, pp. 103–112

  18. [18]

    Sparse tensor factorization on many-core processors with high-bandwidth memory,

    S. Smith, J. Park, and G. Karypis, “Sparse tensor factorization on many-core processors with high-bandwidth memory,” in Proc. IEEE Int. Parallel and Distributed Processing Symp , 2017, pp. 1058–1067

  19. [19]

    An input- adaptive and in-place approach to dense tensor-times-matrix multiply,

    J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc, “An input- adaptive and in-place approach to dense tensor-times-matrix multiply,” in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis , 2015, pp. 1–12

  20. [20]

    Accelerating matrix product on reconfigurable hardware for signal processing,

    A. Amira, A. Bouridane, and P. Milligan, “Accelerating matrix product on reconfigurable hardware for signal processing,” in Proc. Int. Conf. Field Programmable Logic and Applications , 2001, pp. 101–111

  21. [21]

    64-bit floating-point FPGA matrix multiplication,

    Y . Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, “64-bit floating-point FPGA matrix multiplication,” in Proc. Int. Symp. Field- programmable Gate Arrays , 2005, pp. 86–95

  22. [22]

    FPGA implementations of neural networks–a survey of a decade of progress,

    J. Zhu and P. Sutton, “FPGA implementations of neural networks–a survey of a decade of progress,” in Proc. FPLA, 2003, pp. 1062–1066

  23. [23]

    DLAU: A scalable deep learning accelerator unit on FPGA,

    C. Wang, L. Gong, Q. Yu, X. Li, Y . Xie, and X. Zhou, “DLAU: A scalable deep learning accelerator unit on FPGA,” IEEE Trans. CAD of Integr . Circuits and Systems, vol. 36, no. 3, pp. 513–517, 2017

  24. [24]

    A hardware efficient support vector machine architecture for FPGA,

    K. Irick, M. DeBole, V . Narayanan, and A. Gayasen, “A hardware efficient support vector machine architecture for FPGA,” in Proc. Int. Symp. FPCCM , 2008, pp. 304–305

  25. [25]

    Low-complexity FPGA implementa- tion of compressive sensing reconstruction,

    J. L. Stanislaus and T. Mohsenin, “Low-complexity FPGA implementa- tion of compressive sensing reconstruction,” in Int. Conf. Comput., Netw. Comm., 2013, pp. 671–675

  26. [26]

    Multilinear image analysis for facial recognition,

    M. A. O. Vasilescu and D. Terzopoulos, “Multilinear image analysis for facial recognition,” in Object recognition supported by user interaction for service robots , vol. 2. IEEE, 2002, pp. 511–514

  27. [27]

    Tensor decomposition for signal processing and machine learning,

    N. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Trans. Sign. Proc., vol. 65, no. 13, pp. 3551–3582, 2017

  28. [28]

    Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

    Y .-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530 , 2015

  29. [29]

    On the best rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors,

    L. De Lathauwer, B. De Moor, and J. Vandewalle, “On the best rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors,” SIAM J. Matrix Analysis and Applications , vol. 21, no. 4, pp. 1324–1342, 2000

  30. [30]

    Implications of factor analysis of three-way matrices for measurement of change,

    L. R. Tucker, “Implications of factor analysis of three-way matrices for measurement of change,” Problems in Measuring Change , vol. 15, pp. 122–137, 1963

  31. [31]

    Principal component analysis of three-mode data by means of alternating least squares algorithms,

    P. M. Kroonenberg and J. De Leeuw, “Principal component analysis of three-mode data by means of alternating least squares algorithms,” Psychometrika, vol. 45, no. 1, pp. 69–97, 1980

  32. [32]

    An approach ton-mode components analysis,

    A. Kapteyn, H. Neudecker, and T. Wansbeek, “An approach ton-mode components analysis,” Psychometrika, vol. 51, no. 2, pp. 269–275, 1986

  33. [33]

    On cyclic Jacobi methods,

    E. R. Hansen, “On cyclic Jacobi methods,” J. Soc. Indust. and Appl. Math., vol. 11, no. 2, pp. 448–459, 1963

  34. [34]

    Jacobi’s method is more accurate than QR,

    J. Demmel and K. Veseli ´c, “Jacobi’s method is more accurate than QR,” SIAM J. Matrix Anal. Appl. , vol. 13, no. 4, pp. 1204–1245, 1992

  35. [35]

    A systolic VLSI architecture for complex SVD,

    N. Hemkumar and J. Cavallaro, “A systolic VLSI architecture for complex SVD,” in Proc. IEEE ISCAS , vol. 3, 1992, pp. 1061–1064

  36. [36]

    Improved SVD systolic array and implementation on FPGA,

    A. Ahmedsaid, A. Amira, and A. Bouridane, “Improved SVD systolic array and implementation on FPGA,” in Proc. FPL, 2003, pp. 35–42

  37. [37]

    FPGA based singular value decomposition for image processing applications,

    M. Rahmati, M. S. Sadri, and M. A. Naeini, “FPGA based singular value decomposition for image processing applications,” in IEEE Intl. Conf. ASSAP , 2008, pp. 185–190

  38. [38]

    The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,

    R. Brent and F. Luk, “The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,”SIAM J. Sci. Stat. Comp., vol. 6, no. 1, pp. 69–84, 1985

  39. [39]

    The CORDIC trigonometric computing technique,

    J. E. V older, “The CORDIC trigonometric computing technique,” IRE Trans. Electronic Computers , no. 3, pp. 330–334, 1959

  40. [40]

    Matlab tensor toolbox version 2.6,

    B. Bader, T. Kolda et al. , “Matlab tensor toolbox version 2.6,” Available online, February 2015. [Online]. Available: http://www. sandia.gov/∼tgkolda/TensorToolbox/

  41. [41]

    Algorithm 862: MATLAB tensor classes for fast algorithm prototyping,

    B. W. Bader and T. G. Kolda, “Algorithm 862: MATLAB tensor classes for fast algorithm prototyping,” ACM Trans. Math. Software , vol. 32, no. 4, pp. 635–653, Dec 2006

  42. [42]

    TensorLy: Tensor Learning in Python

    J. Kossaifi, Y . Panagakis, A. Anandkumar, and M. Pantic, “TensorLy: Tensor learning in python,” CoRR, vol. abs/1610.09555, 2018

  43. [43]

    First-pass myocardial perfusion real-time MRI dataset,

    “First-pass myocardial perfusion real-time MRI dataset,” https://statweb. stanford.edu/∼candes/SURE/matlab/JDT/DATA/invivo perfusion4.mat, accessed: 2019-03-19

  44. [44]

    Accelerated dynamic MRI exploiting sparsity and low-rank structure: k-t SLR,

    S. Lingala, Y . Hu, E. DiBella, and M. Jacob, “Accelerated dynamic MRI exploiting sparsity and low-rank structure: k-t SLR,” IEEE Trans. Medical Imaging , vol. 30, no. 5, pp. 1042–1054, 2011