Tucker Tensor Decomposition on FPGA

Kaiqi Zhang; Xiyuan Zhang; Zheng Zhang

arxiv: 1907.01522 · v1 · pith:MF3OBMYCnew · submitted 2019-06-28 · 📡 eess.SP · cs.AR

Tucker Tensor Decomposition on FPGA

Kaiqi Zhang , Xiyuan Zhang , Zheng Zhang This is my paper

Pith reviewed 2026-05-25 13:01 UTC · model grok-4.3

classification 📡 eess.SP cs.AR

keywords Tucker decompositionFPGA acceleratortensor computationhardware optimizationMRI data processingfixed-point arithmeticJacobi SVDspeedup evaluation

0 comments

The pith

FPGA hardware accelerator for Tucker decomposition delivers 2.16-30.2x speedup over CPU and GPU on cardiac MRI data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds and evaluates an FPGA implementation of Tucker tensor decomposition to bring high-dimensional tensor methods to resource-limited hardware. It breaks the algorithm into three core modules—tensor-times-matrix multiplication, singular value decomposition via warm-start Jacobi iterations, and tensor permutation—and realizes them in fixed-point arithmetic. Tests on synthetic tensors and a real cardiac MRI dataset show the design runs substantially faster than established software libraries on both CPUs and GPUs. The work therefore claims that dedicated hardware can make classical tensor decompositions practical for embedded or real-time scientific applications.

Core claim

We present an FPGA-based hardware accelerator for Tucker decomposition that implements TTM, SVD with warm-start Jacobi iterations, and permutation operations in fixed-point. On a cardiac MRI dataset, this achieves 2.16 to 30.2 times speedup over state-of-the-art software toolboxes on CPU and GPU while maintaining useful numerical accuracy.

What carries the argument

FPGA architecture with dedicated modules for tensor-times-matrix multiplication, warm-start Jacobi SVD, and tensor permutation, evaluated through fixed-point simulation.

If this is right

Tensor decompositions become feasible inside power- or size-constrained medical imaging devices.
Warm-start Jacobi iterations reduce iteration count in hardware SVD, shortening overall runtime.
Fixed-point designs lower resource usage on FPGAs compared with floating-point alternatives.
The modular breakdown allows reuse of TTM and permutation blocks for related tensor algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-point FPGA blocks could be retargeted to other tensor factorizations such as CP decomposition.
Power measurements on the FPGA would likely show lower energy per decomposition than GPU baselines.
Scaling the design to larger tensors would depend on memory bandwidth rather than arithmetic throughput.
Warm-start techniques may transfer to other iterative linear-algebra kernels in hardware.

Load-bearing premise

Fixed-point arithmetic preserves enough numerical accuracy that the resulting Tucker factors remain useful on real data such as MRI without large degradation relative to floating-point baselines.

What would settle it

Run the fixed-point FPGA implementation and a double-precision reference on the same cardiac MRI dataset; if the relative reconstruction error or downstream analysis quality differs by more than a few percent, the performance claim does not hold.

Figures

Figures reproduced from arXiv: 1907.01522 by Kaiqi Zhang, Xiyuan Zhang, Zheng Zhang.

**Figure 2.** Figure 2: Tensor unfolding. Tensor permutation: Tensor permutation changes the mode order of a tensor. It is a high-order extension of the matrix transpose. For instance, given X ∈ R 5×10×3 , permute(X , [2, 3, 1]) generates a new tensor Y ∈ R 10×3×5 with yi2,i3,i1 = xi1,i2,i3 . Unfolding: Unfolding (or matricization) as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 5.** Figure 5: The TTM unit. The red part is used when computing [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 4.** Figure 4: Overall structure of our Tucker decomposition. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: (a) The details of a PE. The red part is used only for [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Block diagram of SVD unit. ACC: accumulator. FIFO: [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Convergence speed (measured as the total number of [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 9.** Figure 9: Runtime and convergence of HOOI on some randomly generated 3-way tensors with size [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: When the size along each dimension is 256, MATLAB [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 11.** Figure 11: Decomposition result of MRI dataset. Top: original [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

read the original abstract

Tensor computation has emerged as a powerful mathematical tool for solving high-dimensional and/or extreme-scale problems in science and engineering. The last decade has witnessed tremendous advancement of tensor computation and its applications in machine learning and big data. However, its hardware optimization on resource-constrained devices remains an (almost) unexplored field. This paper presents an hardware accelerator for a classical tensor computation framework, Tucker decomposition. We study three modules of this architecture: tensor-times-matrix (TTM), matrix singular value decomposition (SVD), and tensor permutation, and implemented them on Xilinx FPGA for prototyping. In order to further reduce the computing time, a warm-start algorithm for the Jacobi iterations in SVD is proposed. A fixed-point simulator is used to evaluate the performance of our design. Some synthetic data sets and a real MRI data set are used to validate the design and evaluate its performance. We compare our work with state-of-the-art software toolboxes running on both CPU and GPU, and our work shows 2.16 - 30.2x speedup on the cardiac MRI data set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FPGA Tucker design reports speedups on MRI but leaves fixed-point accuracy unverified against floating-point baselines.

read the letter

The core of this paper is a practical FPGA implementation of Tucker decomposition, built from TTM, warm-start Jacobi SVD, and permutation blocks, with a fixed-point simulator and direct timing comparisons to CPU/GPU toolboxes on synthetic data and cardiac MRI. The reported 2.16–30.2× speedups are the main empirical result, and the warm-start SVD tweak is a reasonable engineering choice to cut iterations on hardware. That part is straightforward and useful for anyone who needs tensor work on resource-limited devices. The implementation appears to follow standard module patterns rather than introducing new algorithms, which keeps the contribution focused on the hardware mapping. The comparison methodology is external and non-circular, which is a plus. The soft spot is exactly the one the stress-test flags: the abstract gives no numbers on reconstruction error, factor orthogonality, or core-tensor difference between the fixed-point version and double-precision Tucker on the MRI data. Without those, the speedup claim is hard to interpret as delivering the same result. The paper does not appear to contain machine-checked proofs or open code, so the claims rest on the simulator runs described. This work is aimed at hardware designers or embedded signal-processing groups who already care about tensor methods on FPGAs; a general tensor-theory reader will find little new. It is coherent on its own terms and shows clear engineering effort, so it deserves a serious referee even though the accuracy gap needs addressing in revision.

Referee Report

2 major / 0 minor

Summary. The paper presents an FPGA hardware accelerator for Tucker tensor decomposition, implementing tensor-times-matrix (TTM), matrix SVD (with a proposed warm-start Jacobi algorithm), and tensor permutation modules in fixed-point arithmetic on Xilinx FPGA. It evaluates the design using synthetic datasets and a real cardiac MRI dataset, claiming 2.16–30.2× wall-clock speedups versus state-of-the-art CPU/GPU software toolboxes.

Significance. If the fixed-point implementation is shown to preserve decomposition fidelity on real data, the work would demonstrate a practical route to accelerating tensor methods on resource-constrained hardware, which remains underexplored relative to software toolboxes.

major comments (2)

[Abstract] Abstract: The central speedup claim (2.16–30.2× on cardiac MRI) is presented without any accompanying accuracy metrics—such as relative Frobenius residual, core-tensor difference, or factor orthogonality—comparing the fixed-point FPGA output to double-precision baselines on the same MRI data. This omission prevents verification that the reported wall-clock advantage corresponds to a numerically equivalent result.
[Abstract] Abstract and validation section: The fixed-point simulator and hardware pipeline (TTM/SVD/permutation) are described, yet no quantitative comparison of reconstruction quality or factor accuracy versus floating-point references is supplied for the MRI dataset, leaving the weakest assumption (that fixed-point preserves utility) untested in the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for explicit accuracy validation of the fixed-point design on the cardiac MRI dataset. We agree this strengthens the paper and will incorporate the requested metrics in revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central speedup claim (2.16–30.2× on cardiac MRI) is presented without any accompanying accuracy metrics—such as relative Frobenius residual, core-tensor difference, or factor orthogonality—comparing the fixed-point FPGA output to double-precision baselines on the same MRI data. This omission prevents verification that the reported wall-clock advantage corresponds to a numerically equivalent result.

Authors: We agree that the abstract and results should include these metrics to substantiate numerical equivalence. The fixed-point simulator was used for all experiments, and we will add relative Frobenius residual, core-tensor difference, and factor orthogonality comparisons between the fixed-point outputs and double-precision baselines specifically for the cardiac MRI data in both the abstract and results sections. revision: yes
Referee: [Abstract] Abstract and validation section: The fixed-point simulator and hardware pipeline (TTM/SVD/permutation) are described, yet no quantitative comparison of reconstruction quality or factor accuracy versus floating-point references is supplied for the MRI dataset, leaving the weakest assumption (that fixed-point preserves utility) untested in the reported experiments.

Authors: We concur that the validation section requires explicit quantitative accuracy results for the MRI dataset. While synthetic datasets received some accuracy checks, the MRI experiments emphasized runtime. In revision we will report reconstruction quality (e.g., relative residual) and factor accuracy metrics versus floating-point references for the MRI case, confirming that the fixed-point design preserves utility at the chosen bit widths. revision: yes

Circularity Check

0 steps flagged

No circularity; speedup claims rest on external benchmarks

full rationale

The paper reports wall-clock speedups (2.16–30.2×) obtained by direct timing of the FPGA design against independent CPU/GPU Tucker toolboxes on cardiac MRI and synthetic data. No equations, fitted parameters, or self-citations are used to derive the performance numbers; the result is an empirical measurement against external baselines. The fixed-point accuracy question is a separate correctness concern and does not create circularity in the reported claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an engineering implementation relying on standard assumptions about FPGA resource mapping and fixed-point arithmetic behavior; no new free parameters, axioms, or invented entities are introduced beyond conventional hardware design practice.

pith-pipeline@v0.9.0 · 5707 in / 1046 out tokens · 53702 ms · 2026-05-25T13:01:58.426243+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

[1]

Tensor decompositions and applications,

T. Kolda and B. Bader, “Tensor decompositions and applications,” SIAM Rev, vol. 51, no. 3, pp. 455–500, 2009

work page 2009
[2]

The expression of a tensor or a polyadic as a sum of products,

F. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” J. Math. Phys. , vol. 6, no. 1-4, pp. 164–189, 1927

work page 1927
[3]

Some mathematical notes on three-mode factor analysis,

L. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966

work page 1966
[4]

Tensor-train decomposition,

I. V . Oseledets, “Tensor-train decomposition,” SIAM Journal Sci. Comp. , vol. 33, no. 5, pp. 2295–2317, 2011

work page 2011
[5]

Scalable tensor decompositions for multi-aspect data mining,

T. G. Kolda and J. Sun, “Scalable tensor decompositions for multi-aspect data mining,” in Proc. IEEE Int. Conf. Data Mining , 2008, pp. 363–372

work page 2008
[6]

Bayesian Tensorized Neural Networks with Automatic Rank Selection

C. Hawkins and Z. Zhang, “Bayesian tensorized neural networks with automatic rank selection,” arXiv preprint arXiv:1905.10478 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[7]

A compact CNN-DBLSTM based character model for ofﬂine handwrit- ing recognition with tucker decomposition,

H. Ding, K. Chen, Y . Yuan, M. Cai, L. Sun, S. Liang, and Q. Huo, “A compact CNN-DBLSTM based character model for ofﬂine handwrit- ing recognition with tucker decomposition,” in Proc. IEEE Int. Conf. Document Analysis and Recognition , vol. 1, 2017, pp. 507–512

work page 2017
[8]

Tensor-factorized neural networks,

J.-T. Chien and Y .-T. Bao, “Tensor-factorized neural networks,” IEEE Trans. Neur . Networks Learn. Syst., vol. 29, no. 5, pp. 1998–2011, 2018

work page 1998
[9]

Tensorizing neural networks,

A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” in NIPS, 2015, pp. 442–450

work page 2015
[10]

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempit- sky, “Speeding-up convolutional neural networks using ﬁne-tuned CP- decomposition,” arXiv preprint arXiv:1412.6553 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

Tensor-train recurrent neural networks for video classiﬁcation,

Y . Yang, D. Krompass, and V . Tresp, “Tensor-train recurrent neural networks for video classiﬁcation,” inProc. Int. Conf. Machine Learning , 2017, pp. 3891–3900

work page 2017
[12]

Enabling high-dimensional hierarchical uncertainty quantiﬁcation by anova and tensor-train decomposition,

Z. Zhang, X. Yang, I. V . Oseledets, G. E. Karniadakis, and L. Daniel, “Enabling high-dimensional hierarchical uncertainty quantiﬁcation by anova and tensor-train decomposition,” IEEE Trans. CAD of Integrated Circuits and Systems , vol. 34, no. 1, pp. 63–76, 2015

work page 2015
[13]

Big-data tensor recovery for high-dimensional uncertainty quantiﬁcation of process variations,

Z. Zhang, T.-W. Weng, and L. Daniel, “Big-data tensor recovery for high-dimensional uncertainty quantiﬁcation of process variations,” IEEE Trans. Comp., Pack. Manuf. Tech. , vol. 7, no. 5, pp. 687–697, 2017

work page 2017
[14]

Dynamic mri reconstruction using low rank plus sparse tensor decomposition,

S. F. Roohi, D. Zonoobi, A. A. Kassim, and J. L. Jaremko, “Dynamic mri reconstruction using low rank plus sparse tensor decomposition,” in Proc. Int. Conf. Image Process. , 2016, pp. 1769–1773

work page 2016
[15]

Robust tensor subspace learning for anomaly detection,

J. Li, G. Han, J. Wen, and X. Gao, “Robust tensor subspace learning for anomaly detection,” Int. J. Machine Learning and Cybernetics , vol. 2, no. 2, pp. 89–98, 2011

work page 2011
[16]

One-to-many voice conversion based on tensor representation of speaker space,

D. Saito, K. Yamamoto, N. Minematsu, and K. Hirose, “One-to-many voice conversion based on tensor representation of speaker space,” in Proc. Int. Conf. Speech Comm. Assoc. , 2011

work page 2011
[17]

High performance parallel algorithms for the Tucker decomposition of sparse tensors,

O. Kaya and B. Uc ¸ar, “High performance parallel algorithms for the Tucker decomposition of sparse tensors,” inProc. IEEE Int. Conf. Parall. Proc., 2016, pp. 103–112

work page 2016
[18]

Sparse tensor factorization on many-core processors with high-bandwidth memory,

S. Smith, J. Park, and G. Karypis, “Sparse tensor factorization on many-core processors with high-bandwidth memory,” in Proc. IEEE Int. Parallel and Distributed Processing Symp , 2017, pp. 1058–1067

work page 2017
[19]

An input- adaptive and in-place approach to dense tensor-times-matrix multiply,

J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc, “An input- adaptive and in-place approach to dense tensor-times-matrix multiply,” in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis , 2015, pp. 1–12

work page 2015
[20]

Accelerating matrix product on reconﬁgurable hardware for signal processing,

A. Amira, A. Bouridane, and P. Milligan, “Accelerating matrix product on reconﬁgurable hardware for signal processing,” in Proc. Int. Conf. Field Programmable Logic and Applications , 2001, pp. 101–111

work page 2001
[21]

64-bit ﬂoating-point FPGA matrix multiplication,

Y . Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, “64-bit ﬂoating-point FPGA matrix multiplication,” in Proc. Int. Symp. Field- programmable Gate Arrays , 2005, pp. 86–95

work page 2005
[22]

FPGA implementations of neural networks–a survey of a decade of progress,

J. Zhu and P. Sutton, “FPGA implementations of neural networks–a survey of a decade of progress,” in Proc. FPLA, 2003, pp. 1062–1066

work page 2003
[23]

DLAU: A scalable deep learning accelerator unit on FPGA,

C. Wang, L. Gong, Q. Yu, X. Li, Y . Xie, and X. Zhou, “DLAU: A scalable deep learning accelerator unit on FPGA,” IEEE Trans. CAD of Integr . Circuits and Systems, vol. 36, no. 3, pp. 513–517, 2017

work page 2017
[24]

A hardware efﬁcient support vector machine architecture for FPGA,

K. Irick, M. DeBole, V . Narayanan, and A. Gayasen, “A hardware efﬁcient support vector machine architecture for FPGA,” in Proc. Int. Symp. FPCCM , 2008, pp. 304–305

work page 2008
[25]

Low-complexity FPGA implementa- tion of compressive sensing reconstruction,

J. L. Stanislaus and T. Mohsenin, “Low-complexity FPGA implementa- tion of compressive sensing reconstruction,” in Int. Conf. Comput., Netw. Comm., 2013, pp. 671–675

work page 2013
[26]

Multilinear image analysis for facial recognition,

M. A. O. Vasilescu and D. Terzopoulos, “Multilinear image analysis for facial recognition,” in Object recognition supported by user interaction for service robots , vol. 2. IEEE, 2002, pp. 511–514

work page 2002
[27]

Tensor decomposition for signal processing and machine learning,

N. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Trans. Sign. Proc., vol. 65, no. 13, pp. 3551–3582, 2017

work page 2017
[28]

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Y .-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[29]

On the best rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors,

L. De Lathauwer, B. De Moor, and J. Vandewalle, “On the best rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors,” SIAM J. Matrix Analysis and Applications , vol. 21, no. 4, pp. 1324–1342, 2000

work page 2000
[30]

Implications of factor analysis of three-way matrices for measurement of change,

L. R. Tucker, “Implications of factor analysis of three-way matrices for measurement of change,” Problems in Measuring Change , vol. 15, pp. 122–137, 1963

work page 1963
[31]

Principal component analysis of three-mode data by means of alternating least squares algorithms,

P. M. Kroonenberg and J. De Leeuw, “Principal component analysis of three-mode data by means of alternating least squares algorithms,” Psychometrika, vol. 45, no. 1, pp. 69–97, 1980

work page 1980
[32]

An approach ton-mode components analysis,

A. Kapteyn, H. Neudecker, and T. Wansbeek, “An approach ton-mode components analysis,” Psychometrika, vol. 51, no. 2, pp. 269–275, 1986

work page 1986
[33]

On cyclic Jacobi methods,

E. R. Hansen, “On cyclic Jacobi methods,” J. Soc. Indust. and Appl. Math., vol. 11, no. 2, pp. 448–459, 1963

work page 1963
[34]

Jacobi’s method is more accurate than QR,

J. Demmel and K. Veseli ´c, “Jacobi’s method is more accurate than QR,” SIAM J. Matrix Anal. Appl. , vol. 13, no. 4, pp. 1204–1245, 1992

work page 1992
[35]

A systolic VLSI architecture for complex SVD,

N. Hemkumar and J. Cavallaro, “A systolic VLSI architecture for complex SVD,” in Proc. IEEE ISCAS , vol. 3, 1992, pp. 1061–1064

work page 1992
[36]

Improved SVD systolic array and implementation on FPGA,

A. Ahmedsaid, A. Amira, and A. Bouridane, “Improved SVD systolic array and implementation on FPGA,” in Proc. FPL, 2003, pp. 35–42

work page 2003
[37]

FPGA based singular value decomposition for image processing applications,

M. Rahmati, M. S. Sadri, and M. A. Naeini, “FPGA based singular value decomposition for image processing applications,” in IEEE Intl. Conf. ASSAP , 2008, pp. 185–190

work page 2008
[38]

The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,

R. Brent and F. Luk, “The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,”SIAM J. Sci. Stat. Comp., vol. 6, no. 1, pp. 69–84, 1985

work page 1985
[39]

The CORDIC trigonometric computing technique,

J. E. V older, “The CORDIC trigonometric computing technique,” IRE Trans. Electronic Computers , no. 3, pp. 330–334, 1959

work page 1959
[40]

Matlab tensor toolbox version 2.6,

B. Bader, T. Kolda et al. , “Matlab tensor toolbox version 2.6,” Available online, February 2015. [Online]. Available: http://www. sandia.gov/∼tgkolda/TensorToolbox/

work page 2015
[41]

Algorithm 862: MATLAB tensor classes for fast algorithm prototyping,

B. W. Bader and T. G. Kolda, “Algorithm 862: MATLAB tensor classes for fast algorithm prototyping,” ACM Trans. Math. Software , vol. 32, no. 4, pp. 635–653, Dec 2006

work page 2006
[42]

TensorLy: Tensor Learning in Python

J. Kossaiﬁ, Y . Panagakis, A. Anandkumar, and M. Pantic, “TensorLy: Tensor learning in python,” CoRR, vol. abs/1610.09555, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

First-pass myocardial perfusion real-time MRI dataset,

“First-pass myocardial perfusion real-time MRI dataset,” https://statweb. stanford.edu/∼candes/SURE/matlab/JDT/DATA/invivo perfusion4.mat, accessed: 2019-03-19

work page 2019
[44]

Accelerated dynamic MRI exploiting sparsity and low-rank structure: k-t SLR,

S. Lingala, Y . Hu, E. DiBella, and M. Jacob, “Accelerated dynamic MRI exploiting sparsity and low-rank structure: k-t SLR,” IEEE Trans. Medical Imaging , vol. 30, no. 5, pp. 1042–1054, 2011

work page 2011

[1] [1]

Tensor decompositions and applications,

T. Kolda and B. Bader, “Tensor decompositions and applications,” SIAM Rev, vol. 51, no. 3, pp. 455–500, 2009

work page 2009

[2] [2]

The expression of a tensor or a polyadic as a sum of products,

F. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” J. Math. Phys. , vol. 6, no. 1-4, pp. 164–189, 1927

work page 1927

[3] [3]

Some mathematical notes on three-mode factor analysis,

L. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966

work page 1966

[4] [4]

Tensor-train decomposition,

I. V . Oseledets, “Tensor-train decomposition,” SIAM Journal Sci. Comp. , vol. 33, no. 5, pp. 2295–2317, 2011

work page 2011

[5] [5]

Scalable tensor decompositions for multi-aspect data mining,

T. G. Kolda and J. Sun, “Scalable tensor decompositions for multi-aspect data mining,” in Proc. IEEE Int. Conf. Data Mining , 2008, pp. 363–372

work page 2008

[6] [6]

Bayesian Tensorized Neural Networks with Automatic Rank Selection

C. Hawkins and Z. Zhang, “Bayesian tensorized neural networks with automatic rank selection,” arXiv preprint arXiv:1905.10478 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[7] [7]

A compact CNN-DBLSTM based character model for ofﬂine handwrit- ing recognition with tucker decomposition,

H. Ding, K. Chen, Y . Yuan, M. Cai, L. Sun, S. Liang, and Q. Huo, “A compact CNN-DBLSTM based character model for ofﬂine handwrit- ing recognition with tucker decomposition,” in Proc. IEEE Int. Conf. Document Analysis and Recognition , vol. 1, 2017, pp. 507–512

work page 2017

[8] [8]

Tensor-factorized neural networks,

J.-T. Chien and Y .-T. Bao, “Tensor-factorized neural networks,” IEEE Trans. Neur . Networks Learn. Syst., vol. 29, no. 5, pp. 1998–2011, 2018

work page 1998

[9] [9]

Tensorizing neural networks,

A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” in NIPS, 2015, pp. 442–450

work page 2015

[10] [10]

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempit- sky, “Speeding-up convolutional neural networks using ﬁne-tuned CP- decomposition,” arXiv preprint arXiv:1412.6553 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

Tensor-train recurrent neural networks for video classiﬁcation,

Y . Yang, D. Krompass, and V . Tresp, “Tensor-train recurrent neural networks for video classiﬁcation,” inProc. Int. Conf. Machine Learning , 2017, pp. 3891–3900

work page 2017

[12] [12]

Enabling high-dimensional hierarchical uncertainty quantiﬁcation by anova and tensor-train decomposition,

Z. Zhang, X. Yang, I. V . Oseledets, G. E. Karniadakis, and L. Daniel, “Enabling high-dimensional hierarchical uncertainty quantiﬁcation by anova and tensor-train decomposition,” IEEE Trans. CAD of Integrated Circuits and Systems , vol. 34, no. 1, pp. 63–76, 2015

work page 2015

[13] [13]

Big-data tensor recovery for high-dimensional uncertainty quantiﬁcation of process variations,

Z. Zhang, T.-W. Weng, and L. Daniel, “Big-data tensor recovery for high-dimensional uncertainty quantiﬁcation of process variations,” IEEE Trans. Comp., Pack. Manuf. Tech. , vol. 7, no. 5, pp. 687–697, 2017

work page 2017

[14] [14]

Dynamic mri reconstruction using low rank plus sparse tensor decomposition,

S. F. Roohi, D. Zonoobi, A. A. Kassim, and J. L. Jaremko, “Dynamic mri reconstruction using low rank plus sparse tensor decomposition,” in Proc. Int. Conf. Image Process. , 2016, pp. 1769–1773

work page 2016

[15] [15]

Robust tensor subspace learning for anomaly detection,

J. Li, G. Han, J. Wen, and X. Gao, “Robust tensor subspace learning for anomaly detection,” Int. J. Machine Learning and Cybernetics , vol. 2, no. 2, pp. 89–98, 2011

work page 2011

[16] [16]

One-to-many voice conversion based on tensor representation of speaker space,

D. Saito, K. Yamamoto, N. Minematsu, and K. Hirose, “One-to-many voice conversion based on tensor representation of speaker space,” in Proc. Int. Conf. Speech Comm. Assoc. , 2011

work page 2011

[17] [17]

High performance parallel algorithms for the Tucker decomposition of sparse tensors,

O. Kaya and B. Uc ¸ar, “High performance parallel algorithms for the Tucker decomposition of sparse tensors,” inProc. IEEE Int. Conf. Parall. Proc., 2016, pp. 103–112

work page 2016

[18] [18]

Sparse tensor factorization on many-core processors with high-bandwidth memory,

S. Smith, J. Park, and G. Karypis, “Sparse tensor factorization on many-core processors with high-bandwidth memory,” in Proc. IEEE Int. Parallel and Distributed Processing Symp , 2017, pp. 1058–1067

work page 2017

[19] [19]

An input- adaptive and in-place approach to dense tensor-times-matrix multiply,

J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc, “An input- adaptive and in-place approach to dense tensor-times-matrix multiply,” in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis , 2015, pp. 1–12

work page 2015

[20] [20]

Accelerating matrix product on reconﬁgurable hardware for signal processing,

A. Amira, A. Bouridane, and P. Milligan, “Accelerating matrix product on reconﬁgurable hardware for signal processing,” in Proc. Int. Conf. Field Programmable Logic and Applications , 2001, pp. 101–111

work page 2001

[21] [21]

64-bit ﬂoating-point FPGA matrix multiplication,

Y . Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, “64-bit ﬂoating-point FPGA matrix multiplication,” in Proc. Int. Symp. Field- programmable Gate Arrays , 2005, pp. 86–95

work page 2005

[22] [22]

FPGA implementations of neural networks–a survey of a decade of progress,

J. Zhu and P. Sutton, “FPGA implementations of neural networks–a survey of a decade of progress,” in Proc. FPLA, 2003, pp. 1062–1066

work page 2003

[23] [23]

DLAU: A scalable deep learning accelerator unit on FPGA,

C. Wang, L. Gong, Q. Yu, X. Li, Y . Xie, and X. Zhou, “DLAU: A scalable deep learning accelerator unit on FPGA,” IEEE Trans. CAD of Integr . Circuits and Systems, vol. 36, no. 3, pp. 513–517, 2017

work page 2017

[24] [24]

A hardware efﬁcient support vector machine architecture for FPGA,

K. Irick, M. DeBole, V . Narayanan, and A. Gayasen, “A hardware efﬁcient support vector machine architecture for FPGA,” in Proc. Int. Symp. FPCCM , 2008, pp. 304–305

work page 2008

[25] [25]

Low-complexity FPGA implementa- tion of compressive sensing reconstruction,

J. L. Stanislaus and T. Mohsenin, “Low-complexity FPGA implementa- tion of compressive sensing reconstruction,” in Int. Conf. Comput., Netw. Comm., 2013, pp. 671–675

work page 2013

[26] [26]

Multilinear image analysis for facial recognition,

M. A. O. Vasilescu and D. Terzopoulos, “Multilinear image analysis for facial recognition,” in Object recognition supported by user interaction for service robots , vol. 2. IEEE, 2002, pp. 511–514

work page 2002

[27] [27]

Tensor decomposition for signal processing and machine learning,

N. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Trans. Sign. Proc., vol. 65, no. 13, pp. 3551–3582, 2017

work page 2017

[28] [28]

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Y .-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[29] [29]

On the best rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors,

L. De Lathauwer, B. De Moor, and J. Vandewalle, “On the best rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors,” SIAM J. Matrix Analysis and Applications , vol. 21, no. 4, pp. 1324–1342, 2000

work page 2000

[30] [30]

Implications of factor analysis of three-way matrices for measurement of change,

L. R. Tucker, “Implications of factor analysis of three-way matrices for measurement of change,” Problems in Measuring Change , vol. 15, pp. 122–137, 1963

work page 1963

[31] [31]

Principal component analysis of three-mode data by means of alternating least squares algorithms,

P. M. Kroonenberg and J. De Leeuw, “Principal component analysis of three-mode data by means of alternating least squares algorithms,” Psychometrika, vol. 45, no. 1, pp. 69–97, 1980

work page 1980

[32] [32]

An approach ton-mode components analysis,

A. Kapteyn, H. Neudecker, and T. Wansbeek, “An approach ton-mode components analysis,” Psychometrika, vol. 51, no. 2, pp. 269–275, 1986

work page 1986

[33] [33]

On cyclic Jacobi methods,

E. R. Hansen, “On cyclic Jacobi methods,” J. Soc. Indust. and Appl. Math., vol. 11, no. 2, pp. 448–459, 1963

work page 1963

[34] [34]

Jacobi’s method is more accurate than QR,

J. Demmel and K. Veseli ´c, “Jacobi’s method is more accurate than QR,” SIAM J. Matrix Anal. Appl. , vol. 13, no. 4, pp. 1204–1245, 1992

work page 1992

[35] [35]

A systolic VLSI architecture for complex SVD,

N. Hemkumar and J. Cavallaro, “A systolic VLSI architecture for complex SVD,” in Proc. IEEE ISCAS , vol. 3, 1992, pp. 1061–1064

work page 1992

[36] [36]

Improved SVD systolic array and implementation on FPGA,

A. Ahmedsaid, A. Amira, and A. Bouridane, “Improved SVD systolic array and implementation on FPGA,” in Proc. FPL, 2003, pp. 35–42

work page 2003

[37] [37]

FPGA based singular value decomposition for image processing applications,

M. Rahmati, M. S. Sadri, and M. A. Naeini, “FPGA based singular value decomposition for image processing applications,” in IEEE Intl. Conf. ASSAP , 2008, pp. 185–190

work page 2008

[38] [38]

The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,

R. Brent and F. Luk, “The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,”SIAM J. Sci. Stat. Comp., vol. 6, no. 1, pp. 69–84, 1985

work page 1985

[39] [39]

The CORDIC trigonometric computing technique,

J. E. V older, “The CORDIC trigonometric computing technique,” IRE Trans. Electronic Computers , no. 3, pp. 330–334, 1959

work page 1959

[40] [40]

Matlab tensor toolbox version 2.6,

B. Bader, T. Kolda et al. , “Matlab tensor toolbox version 2.6,” Available online, February 2015. [Online]. Available: http://www. sandia.gov/∼tgkolda/TensorToolbox/

work page 2015

[41] [41]

Algorithm 862: MATLAB tensor classes for fast algorithm prototyping,

B. W. Bader and T. G. Kolda, “Algorithm 862: MATLAB tensor classes for fast algorithm prototyping,” ACM Trans. Math. Software , vol. 32, no. 4, pp. 635–653, Dec 2006

work page 2006

[42] [42]

TensorLy: Tensor Learning in Python

J. Kossaiﬁ, Y . Panagakis, A. Anandkumar, and M. Pantic, “TensorLy: Tensor learning in python,” CoRR, vol. abs/1610.09555, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [43]

First-pass myocardial perfusion real-time MRI dataset,

“First-pass myocardial perfusion real-time MRI dataset,” https://statweb. stanford.edu/∼candes/SURE/matlab/JDT/DATA/invivo perfusion4.mat, accessed: 2019-03-19

work page 2019

[44] [44]

Accelerated dynamic MRI exploiting sparsity and low-rank structure: k-t SLR,

S. Lingala, Y . Hu, E. DiBella, and M. Jacob, “Accelerated dynamic MRI exploiting sparsity and low-rank structure: k-t SLR,” IEEE Trans. Medical Imaging , vol. 30, no. 5, pp. 1042–1054, 2011

work page 2011