pith. machine review for the scientific record. sign in

arxiv: 2605.01514 · v1 · submitted 2026-05-02 · 💻 cs.AR · cs.DC

Recognition: unknown

MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:06 UTC · model grok-4.3

classification 💻 cs.AR cs.DC
keywords FPGA acceleratorPrincipal Component AnalysisMatrix MultiplicationSingular Value DecompositionSystolic ArrayJacobi MethodCORDICEnergy Efficiency
0
0 comments X

The pith

MANOJAVAM unifies matrix multiplication and SVD on FPGAs for efficient PCA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MANOJAVAM, a scalable unified FPGA accelerator that combines matrix multiplication using multiple systolic arrays with singular value decomposition using a parallel Jacobi method and CORDIC rotations for principal component analysis. This addresses the computational bottlenecks in PCA for fields like hyperspectral imaging, genomics, and neurosciences by providing a single hardware fabric suitable for any input dimension. The design features block streaming for high throughput and a two-tier cache hierarchy with mode-aware policies to match different access patterns in covariance and rotation computations. Realization on Xilinx FPGAs shows high frequency operation and, for the (16,32) version, substantial gains over a high-performance GPU in both speed and energy. Sympathetic readers would care as this enables energy-efficient large-scale data analytics in edge and high-performance settings.

Core claim

The MANOJAVAM(T,S) architecture unifies matrix multiplication and SVD in a single scalable fabric. It uses S TxT TPU-style systolic arrays with block streaming for matrix multiplication and a highly parallel Jacobian unit with pipelined CORDIC rotations for SVD. A two-tier cache hierarchy and mode-aware memory policies adapt to the memory access patterns of covariance matrix computation and rotation computation. On Xilinx Virtex-Ultrascale+, MANOJAVAM(16,32) runs at 434 MHz consuming 16.957W and achieves up to 22.75x speedup in SVD latency and 42.14x reduction in total energy consumption compared to NVIDIA A6000 GPU on real-world datasets.

What carries the argument

MANOJAVAM(T,S) fabric, which unifies S TxT TPU-style systolic arrays with block streaming for matrix multiplication and a parallel Jacobian SVD unit using pipelined CORDIC rotations, plus two-tier cache hierarchy with mode-aware memory policies.

If this is right

  • The accelerator supports PCA on datasets of any input dimension through its scalable T and S parameters.
  • It reduces the need for separate hardware for matrix multiplication and SVD stages in PCA pipelines.
  • Energy use for SVD operations drops sharply, supporting longer runs in power-limited edge devices.
  • High-frequency operation on modern FPGAs enables real-time processing in data analytics.
  • The unified fabric serves as a base for energy-efficient large-scale PCA in both high-performance and edge settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could extend to ASIC versions for further gains in speed and efficiency beyond FPGA limits.
  • Similar systolic-plus-CORDIC unification might apply to other matrix-heavy tasks such as least-squares solvers.
  • Hybrid systems pairing this FPGA fabric with CPUs could handle even larger datasets by offloading only the heavy stages.
  • Adjusting cache depths for specific dataset sizes could be tested to push scalability further.

Load-bearing premise

The two-tier cache hierarchy and mode-aware memory policies adapt successfully to the distinct memory access patterns of covariance matrix computation and rotation computation without introducing unaccounted bottlenecks or scalability limits for arbitrary input dimensions.

What would settle it

Benchmarking MANOJAVAM(16,32) on a real-world dataset with larger dimensions than tested, where SVD latency speedup falls below 22.75x or total energy reduction falls below 42.14x relative to the NVIDIA A6000 GPU due to memory stalls.

Figures

Figures reproduced from arXiv: 2605.01514 by Anjali Devarajan, Govinda Raju M, Kousthub P Kaivar, K.S Geetha, Shashank D, Sowmyarani C.N, Srivaths Ramasubramanian, Vibha Shrestta.

Figure 1
Figure 1. Figure 1: Breakdown of PCA execution time into matrix multiplication and SVD components under different dataset dimensions. (a) SVD dominates when the [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Manojavam : High Level Architecture [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Block Streaming Illustration calculated Givens rotations to the entire covariance matrix. By acting as a parallel transformation engine that updates multiple rows and columns simultaneously, MANOJAVAM eliminates hardware redundancy and maximizes the area-efficiency of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Jacobian Unit Architecture the main diagonal elements (cpp) during the comparison phase, ensuring the engine only locks onto valid off-diagonal candi￾dates. On reset, a global register is initialized to track the maxi￾mum off-diagonal element and its associated diagonal terms cpp and cqq. As each matrix multiplication pass produces S partial results from the accumulators, the Jacobian Controller inspects t… view at source ↗
Figure 8
Figure 8. Figure 8: Relative error (Eoff ) vs. number of sweeps for multiple datasets [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: Total Execution Time across Benchmark Datasets profiled across all Platforms [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Energy consumption across benchmarks (Log Scale). MANOJAVAM achieves up to [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Design Space Exploration of Architectural Latency: (a) Impact of Tile Size ( [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Power Dissipation Scaling Analysis: (a) Sensitivity of power consumption to Tile Size ( [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Hardware Resource Utilization Scaling: (a) Analysis of FPGA resource requirements (LUTs/DSPs) relative to Tile Size ( [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

Principal Component Analysis (PCA) is widely used for dimensionality reduction in hyperspectral imaging, genomics, and neurosciences. However, it suffers from computational bottlenecks in matrix multiplication and singular value decomposition (SVD). Prior PCA hardware accelerators either target only one of these stages, rely on High Level Synthesis (HLS) that limits microarchitectural optimizations or use fixed point datapaths with limited dataset scalability. There is a need for a unified PCA accelerator that is suitable for datasets of any input dimension. Hence, the proposed work presents MANOJAVAM, a scalable PCA accelerator fabric, unifying matrix multiplication and SVD in a single architecture. MANOJAVAM(T,S) comprises an S number of TxT TPU-style systolic arrays employing block streaming for high-throughput matrix multiplication. It further integrates a highly parallel Jacobian unit implementing the Jacobi method for SVD with pipelined CORDIC based rotations. A two tier cache hierarchy and mode-aware memory policies adapts to the distinct memory access patterns of covariance matrix and rotation computation. For demonstration, MANOJAVAM(4,8) is realized on a Xilinx Artix-7 FPGA, achieving a frequency of 200 MHz at 1.271W. MANOJAVAM(16,32) is realized on Xilinx Virtex-Ultrascale+ FPGA, achieving a frequency of 434 MHz at 16.957W. Benchmarking on real-world datasets reveals that MANOJAVAM(16,32) achieves up to a 22.75x speedup in SVD latency and a 42.14x reduction in total energy consumption compared to a high-performance NVIDIA A6000 GPU. The architecture offers a unified, scalable, and energy-efficient platform for large-scale data analytics in both high-performance and edge-computing environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents MANOJAVAM, a scalable unified FPGA accelerator fabric for PCA that combines block-streaming systolic arrays for matrix multiplication with a parallel Jacobi SVD unit using pipelined CORDIC rotations. It describes a two-tier cache hierarchy with mode-aware memory policies, reports FPGA implementations on Artix-7 (MANOJAVAM(4,8) at 200 MHz, 1.271 W) and Virtex-Ultrascale+ (MANOJAVAM(16,32) at 434 MHz, 16.957 W), and claims up to 22.75x SVD latency speedup and 42.14x total energy reduction versus an NVIDIA A6000 GPU on real-world datasets.

Significance. If the performance and energy claims are supported by fully specified, reproducible baselines and methodology, the work would offer a practical contribution to energy-efficient hardware for PCA in edge and high-performance analytics, providing a single fabric for the two dominant kernels rather than separate accelerators.

major comments (2)
  1. [Abstract] Abstract: The central claims of 22.75x SVD latency speedup and 42.14x energy reduction for MANOJAVAM(16,32) versus the A6000 GPU are presented without any information on the GPU baseline (library, e.g. cuSOLVER or custom Jacobi, optimization flags, precision), exact matrix dimensions or dataset sizes, latency/power measurement tools and scope (board-level vs. kernel-only, nvidia-smi vs. external meter), or statistical details such as error bars or number of runs. These omissions make the reported factors impossible to verify or reproduce and directly undermine the validity of the performance comparison.
  2. [Architecture] Architecture description (two-tier cache and mode-aware policies): The claim that the memory hierarchy successfully adapts to the distinct access patterns of covariance-matrix and rotation phases without introducing scalability bottlenecks for arbitrary input dimensions is stated but not supported by any quantitative data (cache hit rates, stall cycles, or ablation results across matrix sizes). This is load-bearing for the scalability assertion yet lacks the concrete evaluation needed to substantiate it.
minor comments (2)
  1. [Abstract] The abstract refers to 'real-world datasets' without naming them or giving sizes; the main text should explicitly list the datasets, their dimensions, and how they exercise the claimed scalability.
  2. [Implementation] Power figures are given as single values (1.271 W, 16.957 W) with no indication of measurement conditions or variation; add this detail for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point-by-point below and will incorporate the requested clarifications and additional data into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 22.75x SVD latency speedup and 42.14x energy reduction for MANOJAVAM(16,32) versus the A6000 GPU are presented without any information on the GPU baseline (library, e.g. cuSOLVER or custom Jacobi, optimization flags, precision), exact matrix dimensions or dataset sizes, latency/power measurement tools and scope (board-level vs. kernel-only, nvidia-smi vs. external meter), or statistical details such as error bars or number of runs. These omissions make the reported factors impossible to verify or reproduce and directly undermine the validity of the performance comparison.

    Authors: We agree that the abstract and current experimental description omit critical details needed for reproducibility. In the revised manuscript we will expand the abstract and add a dedicated 'Experimental Setup' subsection that explicitly states: (1) the GPU baseline uses the cuSOLVER library (CUDA 11.8) with -O3 flags and double-precision arithmetic for the Jacobi SVD path; (2) the exact matrix dimensions drawn from the real-world datasets (e.g., 2048×2048 covariance matrices for the hyperspectral and genomics benchmarks); (3) latency measured via CUDA events and power via nvidia-smi at the board level, with kernel-only figures also reported; and (4) all results averaged over 10 independent runs with standard deviation and error bars. These additions will make the 22.75× latency and 42.14× energy claims directly verifiable. revision: yes

  2. Referee: [Architecture] Architecture description (two-tier cache and mode-aware policies): The claim that the memory hierarchy successfully adapts to the distinct access patterns of covariance-matrix and rotation phases without introducing scalability bottlenecks for arbitrary input dimensions is stated but not supported by any quantitative data (cache hit rates, stall cycles, or ablation results across matrix sizes). This is load-bearing for the scalability assertion yet lacks the concrete evaluation needed to substantiate it.

    Authors: The manuscript describes the two-tier cache hierarchy and mode-aware policies in Section IV, but we acknowledge that quantitative evidence (hit rates, stall cycles, ablation across sizes) is not provided. In the revision we will add a new evaluation subsection that reports FPGA performance-counter data from the Virtex-Ultrascale+ implementation for matrix sizes ranging from 512×512 to 4096×4096. These data will include L1/L2 hit rates, memory-stall cycles, and throughput scaling curves under both covariance and rotation modes, thereby substantiating that the policies prevent bottlenecks for the supported range of input dimensions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical FPGA implementation with direct measurements

full rationale

The paper presents a hardware architecture for PCA acceleration, its FPGA realization at specific scales (MANOJAVAM(4,8) on Artix-7, MANOJAVAM(16,32) on Virtex-Ultrascale+), and measured results for frequency, power, latency, and energy. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. Claims rest on synthesis reports and benchmarking rather than self-referential math or load-bearing self-citations. The GPU comparison is an external empirical baseline, not an internal derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The design rests on standard assumptions about iterative method convergence and hardware arithmetic units rather than new postulates or fitted constants.

free parameters (1)
  • T and S (array size and count in MANOJAVAM(T,S))
    Chosen design parameters to match target FPGA resources and throughput goals.
axioms (2)
  • domain assumption The Jacobi method converges reliably for the SVD computations performed in the parallel unit
    Invoked to justify the highly parallel Jacobian unit for SVD.
  • standard math Pipelined CORDIC provides sufficient accuracy and throughput for the required matrix rotations
    Used as the basis for the rotation hardware in the SVD path.

pith-pipeline@v0.9.0 · 5675 in / 1429 out tokens · 48921 ms · 2026-05-10T16:06:54.860683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 9 canonical work pages

  1. [1]

    Design of ion-implanted mosfet’s with very small physical dimensions,

    R. H. Dennard, F. H. Gaensslen, H.-N. Yu, V . L. Rideout, E. Bassous, and A. R. LeBlanc, “Design of ion-implanted mosfet’s with very small physical dimensions,”IEEE Journal of solid-state circuits, vol. 9, no. 5, pp. 256–268, 1974

  2. [2]

    Fifty years of moore’s law,

    C. A. Mack, “Fifty years of moore’s law,”IEEE Transactions on semiconductor manufacturing, vol. 24, no. 2, pp. 202–207, 2011

  3. [3]

    Accelerator-rich architectures: Opportunities and progresses,

    J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Rein- man, “Accelerator-rich architectures: Opportunities and progresses,” in Proceedings of the 51st annual design automation conference, 2014, pp. 1–6

  4. [4]

    Understanding the efficiency of gpu algorithms for matrix-matrix multiplication,

    K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of gpu algorithms for matrix-matrix multiplication,” in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference 15 on Graphics Hardware, ser. HWWS ’04. New York, NY , USA: Association for Computing Machinery, 2004, p. 133–137. [Online]. Available: https://doi.org/10.1145/1058129.1058148

  5. [5]

    A survey on deep learning hardware accelerators for heterogeneous hpc platforms,

    C. Silvano, D. Ielmini, F. Ferrandi, L. Fiorin, S. Curzel, L. Benini, F. Conti, A. Garofalo, C. Zambelli, E. Caloreet al., “A survey on deep learning hardware accelerators for heterogeneous hpc platforms,”arXiv preprint arXiv:2306.15552, 2023

  6. [6]

    Dark silicon and the end of multicore scaling,

    H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” inPro- ceedings of the 38th annual international symposium on Computer architecture, 2011, pp. 365–376

  7. [7]

    A survey on neural network hardware accelerators,

    T. Mohaidat and K. Khalil, “A survey on neural network hardware accelerators,”IEEE Transactions on Artificial Intelligence, vol. 5, no. 8, pp. 3801–3822, 2024

  8. [8]

    Hyperspectral image compression using jpeg2000 and principal component analysis,

    Q. Du and J. E. Fowler, “Hyperspectral image compression using jpeg2000 and principal component analysis,”IEEE Geoscience and Remote sensing letters, vol. 4, no. 2, pp. 201–205, 2007

  9. [9]

    Principal component analysis for hyper- spectral image classification,

    C. Rodarmel and J. Shan, “Principal component analysis for hyper- spectral image classification,”Surveying and Land Information Science, vol. 62, no. 2, pp. 115–122, 2002

  10. [10]

    Hardware accelerators for real-time face recognition: A survey,

    A. Baobaid, M. Meribout, V . K. Tiwari, and J. P. Pena, “Hardware accelerators for real-time face recognition: A survey,”IEEE Access, vol. 10, pp. 83 723–83 739, 2022

  11. [11]

    Dimensionality reduction in neuroscience,

    R. Pang, B. J. Lansdell, and A. L. Fairhall, “Dimensionality reduction in neuroscience,”Current Biology, vol. 26, no. 14, pp. R656–R660, 2016

  12. [12]

    Fast principal component analysis of large- scale genome-wide data,

    G. Abraham and M. Inouye, “Fast principal component analysis of large- scale genome-wide data,”PloS one, vol. 9, no. 4, p. e93766, 2014

  13. [13]

    Benchmarking principal component analysis for large-scale single-cell rna-sequencing,

    K. Tsuyuzaki, H. Sato, K. Sato, and I. Nikaido, “Benchmarking principal component analysis for large-scale single-cell rna-sequencing,”Genome biology, vol. 21, no. 1, p. 9, 2020

  14. [14]

    Principal component analysis,

    M. Greenacre, P. J. Groenen, T. Hastie, A. I. d’Enza, A. Markos, and E. Tuzhilina, “Principal component analysis,”Nature Reviews Methods Primers, vol. 2, no. 1, p. 100, 2022

  15. [15]

    A Tutorial on Principal Component Analysis

    J. Shlens, “A tutorial on principal component analysis,”arXiv preprint arXiv:1404.1100, 2014

  16. [16]

    Principal component analysis,

    R. Bro and A. K. Smilde, “Principal component analysis,”Analytical methods, vol. 6, no. 9, pp. 2812–2831, 2014

  17. [17]

    A reconfigurable hardware archi- tecture for principal component analysis,

    U. A. Korat and A. Alimohammad, “A reconfigurable hardware archi- tecture for principal component analysis,”Circuits, Systems, and Signal Processing, vol. 38, pp. 2097–2113, 2019

  18. [18]

    Fpga implemen- tation of the principal component analysis algorithm for dimensionality reduction of hyperspectral images,

    D. Fernandez, C. Gonzalez, D. Mozos, and S. Lopez, “Fpga implemen- tation of the principal component analysis algorithm for dimensionality reduction of hyperspectral images,”Journal of Real-Time Image Pro- cessing, vol. 16, pp. 1395–1406, 2019

  19. [19]

    High level design of a flexible pca hardware accelerator using a new block-streaming method,

    M. A. Mansoori and M. R. Casu, “High level design of a flexible pca hardware accelerator using a new block-streaming method,”Electronics, vol. 9, no. 3, p. 449, 2020

  20. [20]

    An fpga-based network intrusion detection architecture,

    A. Das, D. Nguyen, J. Zambreno, G. Memik, and A. Choudhary, “An fpga-based network intrusion detection architecture,”IEEE Transactions on Information Forensics and Security, vol. 3, no. 1, pp. 118–132, 2008

  21. [21]

    Dynamic partial reconfigurable hardware architecture for principal component analysis on mobile and embedded devices,

    S. N. Shahrouzi and D. G. Perera, “Dynamic partial reconfigurable hardware architecture for principal component analysis on mobile and embedded devices,”EURASIP Journal on Embedded Systems, vol. 2017, pp. 1–18, 2017

  22. [22]

    A streaming pca vlsi chip for neural data compression,

    T. Wu, W. Zhao, H. Guo, H. H. Lim, and Z. Yang, “A streaming pca vlsi chip for neural data compression,”IEEE transactions on biomedical circuits and systems, vol. 11, no. 6, pp. 1290–1302, 2017

  23. [23]

    Preliminary results from a 49-channel neural recording asic with embedded spike compression in 28 nm cmos,

    W. Lemaire, E. R. Koleibi, T. Omrani, M. Benhouria, K. Koua, C. Ques- nel, L.-P. Gauthier, J. M ´enard, K. Gagnon, S. Royet al., “Preliminary results from a 49-channel neural recording asic with embedded spike compression in 28 nm cmos,” in2022 20th IEEE Interregional NEWCAS Conference (NEWCAS). IEEE, 2022, pp. 285–289

  24. [24]

    A principal component neural network-based face recognition system and asic implementation,

    C. S. S. Prasanna, N. Sudha, and V . Kamakoti, “A principal component neural network-based face recognition system and asic implementation,” in18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design. IEEE, 2005, pp. 795–798

  25. [25]

    Fpga-based fully parallel pca-ann for spectrum sensing,

    A. Elrharras, S. El Moukhlis, R. Saadane, M. Wahbi, and A. Hamdoun, “Fpga-based fully parallel pca-ann for spectrum sensing,”Computer and Information Science, vol. 8, no. 1, p. 108, 2015

  26. [26]

    Fpga-based odor classification system using principal component analysis,

    T. Tongyoo and Y . Ariyakul, “Fpga-based odor classification system using principal component analysis,” in2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST). IEEE, 2018, pp. 1–4

  27. [27]

    Hardware pca for gas identification systems using high level synthesis on the zynq soc,

    A. A. S. Ali, A. Amira, F. Bensaali, and M. Benammar, “Hardware pca for gas identification systems using high level synthesis on the zynq soc,” in2013 IEEE 20th International Conference on Electronics, Circuits, and Systems (ICECS). IEEE, 2013, pp. 707–710

  28. [28]

    Accelerating svd computation on fpgas for dsp systems,

    Y . Ma and D. Wang, “Accelerating svd computation on fpgas for dsp systems,” in2016 IEEE 13th International Conference on Signal Processing (ICSP). IEEE, 2016, pp. 487–490

  29. [29]

    Reconfigurable adaptive singular value decomposition engine design for high-throughput mimo-ofdm systems,

    Y .-L. Chen, C.-Z. Zhan, T.-J. Jheng, and A.-Y . Wu, “Reconfigurable adaptive singular value decomposition engine design for high-throughput mimo-ofdm systems,”IEEE transactions on very large scale integration (VLSI) systems, vol. 21, no. 4, pp. 747–760, 2012

  30. [30]

    An fpga implementation of the hestenes- jacobi algorithm for singular value decomposition,

    X. Wang and J. Zambreno, “An fpga implementation of the hestenes- jacobi algorithm for singular value decomposition,” in2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE, 2014, pp. 220–227

  31. [31]

    Fast implementation for the singular value and eigenvalue decomposition based on fpga,

    S. Zhang, X. Tian, C. Xiong, J. Tian, and D. Ming, “Fast implementation for the singular value and eigenvalue decomposition based on fpga,” Chinese Journal of Electronics, vol. 26, no. 1, pp. 132–136, 2017

  32. [32]

    Real-time signal processing of massive sensor arrays via a parallel fast converging svd algorithm: Latency, throughput, and resource analysis,

    M. V . Athi, S. R. Zekavat, and A. A. Struthers, “Real-time signal processing of massive sensor arrays via a parallel fast converging svd algorithm: Latency, throughput, and resource analysis,”IEEE Sensors Journal, vol. 16, no. 8, pp. 2519–2526, 2016

  33. [33]

    Fpga, gpu, and cpu implementations of jacobi algorithm for eigenanalysis,

    M. U. Torun, O. Yilmaz, and A. N. Akansu, “Fpga, gpu, and cpu implementations of jacobi algorithm for eigenanalysis,”Journal of Parallel and Distributed Computing, vol. 96, pp. 172–180, 2016

  34. [34]

    The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,

    R. P. Brent and F. T. Luk, “The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,”SIAM Journal on Scientific and Statistical Computing, vol. 6, no. 1, pp. 69–84, 1985. [Online]. Available: https://doi.org/10.1137/0906007

  35. [35]

    A survey of cordic algorithms for fpga based computers,

    R. Andraka, “A survey of cordic algorithms for fpga based computers,” inProceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays, ser. FPGA ’98. New York, NY , USA: Association for Computing Machinery, 1998, p. 191–200. [Online]. Available: https://doi.org/10.1145/275107.275139

  36. [36]

    Calculating the singular values and pseudo- inverse of a matrix,

    G. Golub and W. Kahan, “Calculating the singular values and pseudo- inverse of a matrix,”Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, vol. 2, no. 2, pp. 205–224, 1965

  37. [37]

    Music-lite: Efficient music using approximate computing: An ofdm radar case study,

    R. Bhattacharjya, A. Sarkar, B. Maity, and N. Dutt, “Music-lite: Efficient music using approximate computing: An ofdm radar case study,”IEEE Embedded Systems Letters, vol. 16, no. 4, pp. 329–332, 2024

  38. [38]

    The cordic trigonometric computing technique,

    J. E. V older, “The cordic trigonometric computing technique,”IRE Transactions on electronic computers, no. 3, pp. 330–334, 1959

  39. [39]

    50 years of cordic: Algorithms, architectures, and applications,

    P. K. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, “50 years of cordic: Algorithms, architectures, and applications,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56, no. 9, pp. 1893–1907, 2009

  40. [40]

    Charm 2.0: Composing heterogeneous accelerators for deep learning on versal acap architecture,

    J. Zhuang, J. Lau, H. Ye, Z. Yang, S. Ji, J. Lo, K. Denolf, S. Neuen- dorffer, A. Jones, J. Huet al., “Charm 2.0: Composing heterogeneous accelerators for deep learning on versal acap architecture,”ACM Trans- actions on Reconfigurable Technology and Systems, vol. 17, no. 3, pp. 1–31, 2024

  41. [41]

    Highly efficient self-checking matrix multiplication on tiled amx accelerators,

    C. S. Mummidi, V . C. Ferreira, S. Srinivasan, and S. Kundu, “Highly efficient self-checking matrix multiplication on tiled amx accelerators,” ACM Transactions on Architecture and Code Optimization, vol. 21, no. 2, pp. 1–22, 2024

  42. [42]

    In-datacenter performance analysis of a tensor processing unit,

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

  43. [43]

    Siracusa: A 16 nm heterogenous risc-v soc for extended reality with at-mram neural engine,

    A. S. Prasad, M. Scherer, F. Conti, D. Rossi, A. Di Mauro, M. Eggimann, J. T. G ´omez, Z. Li, S. S. Sarwar, Z. Wanget al., “Siracusa: A 16 nm heterogenous risc-v soc for extended reality with at-mram neural engine,”IEEE Journal of Solid-State Circuits, 2024

  44. [44]

    Autoai2c: An automated hardware generator for dnn acceleration on both fpga and asic,

    Y . Zhang, X. Zhang, P. Xu, Y . Zhao, C. Hao, D. Chen, and Y . Lin, “Autoai2c: An automated hardware generator for dnn acceleration on both fpga and asic,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024

  45. [45]

    Hdreason: Algorithm-hardware codesign for hyperdimensional knowledge graph reasoning,

    H. Chen, Y . Ni, A. Zakeri, Z. Zou, S. Yun, F. Wen, B. Khaleghi, N. Srinivasa, H. Latapie, and M. Imani, “Hdreason: Algorithm-hardware codesign for hyperdimensional knowledge graph reasoning,”arXiv preprint arXiv:2403.05763, 2024

  46. [46]

    S2tar: Shared secure trusted accelerators with reconfiguration for machine learning in the cloud,

    W. Ren, S. Koteshwara, M. Ye, H. Franke, and D. Chen, “S2tar: Shared secure trusted accelerators with reconfiguration for machine learning in the cloud,” in2024 IEEE 17th International Conference on Cloud Computing (CLOUD). IEEE, 2024, pp. 267–278

  47. [47]

    Hardware-assisted virtualization of neural processing units for cloud platforms,

    Y . Xue, Y . Liu, L. Nai, and J. Huang, “Hardware-assisted virtualization of neural processing units for cloud platforms,” in2024 57th IEEE/ACM 16 International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 1–16

  48. [48]

    A unified engine for accelerating gnn weighting/aggregation operations, with efficient load balancing and graph-specific caching,

    S. Mondal, S. D. Manasi, K. Kunal, Z. Zeng, S. S. Sapatnekaret al., “A unified engine for accelerating gnn weighting/aggregation operations, with efficient load balancing and graph-specific caching,”IEEE Trans- actions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 12, pp. 4844–4857, 2022

  49. [49]

    Dynasparse: Accelerating gnn inference through dynamic sparsity exploitation,

    B. Zhang and V . Prasanna, “Dynasparse: Accelerating gnn inference through dynamic sparsity exploitation,” in2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2023, pp. 233–244

  50. [50]

    Me-vit: A single-load memory-efficient fpga accelerator for vision transformers,

    K. Marino, P. Zhang, and V . K. Prasanna, “Me-vit: A single-load memory-efficient fpga accelerator for vision transformers,” in2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 2023, pp. 213–223

  51. [51]

    Dap: A 507-gmacs/j 256-core domain adaptive processor for wireless communication and linear algebra kernels in 12-nm finfet,

    K.-Y . Chen, C.-S. Yang, Y .-H. Sun, C.-W. Tseng, M. Fayazi, X. He, S. Feng, Y . Yue, T. Mudge, R. Dreslinskiet al., “Dap: A 507-gmacs/j 256-core domain adaptive processor for wireless communication and linear algebra kernels in 12-nm finfet,”IEEE Journal of Solid-State Circuits, 2024

  52. [52]

    Robovisio: A micro-robot vision domain-specific soc for autonomous navigation enabling fully-on-chip intelligence via 2-mb emram,

    Q. Zhang, Z. Fan, H. An, Z. Wang, Z. Li, G. Wang, P. Abillama, H.-S. Kim, D. Blaauw, and D. Sylvester, “Robovisio: A micro-robot vision domain-specific soc for autonomous navigation enabling fully-on-chip intelligence via 2-mb emram,”IEEE Journal of Solid-State Circuits, 2024

  53. [53]

    Tetrix: Flexible architecture and optimal mapping for tensorized neural network processing,

    J.-F. Zhang, C.-H. Lu, and Z. Zhang, “Tetrix: Flexible architecture and optimal mapping for tensorized neural network processing,”IEEE Transactions on Computers, 2024

  54. [54]

    The livermore fortran kernels: A computer test of the numerical performance range,

    F. H. McMahon, “The livermore fortran kernels: A computer test of the numerical performance range,” Lawrence Livermore National Lab., CA (USA), Tech. Rep., 1986

  55. [55]

    Cache write policies and performance,

    N. P. Jouppi, “Cache write policies and performance,”ACM SIGARCH Computer Architecture News, vol. 21, no. 2, pp. 191–201, 1993

  56. [56]

    Bunch, and gw stewart,

    J. Dongarra, C. Moler, and R. James, “Bunch, and gw stewart,”LIN- PACK users’ guide. SIAM, 1979

  57. [57]

    An intelligent architecture based on field programmable gate arrays de- signed to detect moving objects by using principal component analysis,

    I. Bravo, M. Mazo, J. L. L ´azaro, A. Gardel, P. Jim ´enez, and D. Pizarro, “An intelligent architecture based on field programmable gate arrays de- signed to detect moving objects by using principal component analysis,” Sensors, vol. 10, no. 10, pp. 9232–9251, 2010

  58. [58]

    Novel field-programmable gate array archi- tecture for computing the eigenvalue decomposition of para-hermitian polynomial matrices,

    S. Kasap and S. Redif, “Novel field-programmable gate array archi- tecture for computing the eigenvalue decomposition of para-hermitian polynomial matrices,”IEEE Transactions on Very Large Scale Integra- tion (VLSI) Systems, vol. 22, no. 3, pp. 522–536, 2013

  59. [59]

    Optical Recognition of Hand- written Digits,

    E. Alpaydin and C. Kaynak, “Optical Recognition of Hand- written Digits,” UCI Machine Learning Repository, 1998, DOI: https://doi.org/10.24432/C50P49

  60. [60]

    Gradient-based learning applied to document recognition,

    Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

  61. [61]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

  62. [62]

    Olivetti faces dataset,

    “Olivetti faces dataset,” AT&T Laboratories Cambridge (1992–1994), available via scikit-learn, 1994, https://scikit-learn.org/0.19/datasets/ olivetti faces.html

  63. [63]

    Street.Breast Cancer Wisconsin (Diagnostic)

    W. Wolberg, O. Mangasarian, N. Street, and W. Street, “Breast Cancer Wisconsin (Diagnostic),” UCI Machine Learning Repository, 1993, DOI: https://doi.org/10.24432/C5DW2B

  64. [64]

    Twenty Newsgroups,

    T. Mitchell, “Twenty Newsgroups,” UCI Machine Learning Repository, 1997, DOI: https://doi.org/10.24432/C5C323

  65. [65]

    NVIDIA Ampere GA102 GPU Ar- chitecture Whitepaper,

    NVIDIA Corporation, “NVIDIA Ampere GA102 GPU Ar- chitecture Whitepaper,” NVIDIA Corporation, Tech. Rep.,

  66. [66]

    Available: https://www.nvidia.com/content/PDF/ nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Srivaths Ramasubramanian(Student Member, IEEE) received the B.E

    [Online]. Available: https://www.nvidia.com/content/PDF/ nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Srivaths Ramasubramanian(Student Member, IEEE) received the B.E. degree in Electronics and Communication Engineering from Rashtreeya Vidyalaya College of Engineering (RVCE), Ben- galuru, India, in 2025. He is currently pursuing the Ph.D. degree...