Recognition: unknown
MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis
Pith reviewed 2026-05-10 16:06 UTC · model grok-4.3
The pith
MANOJAVAM unifies matrix multiplication and SVD on FPGAs for efficient PCA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The MANOJAVAM(T,S) architecture unifies matrix multiplication and SVD in a single scalable fabric. It uses S TxT TPU-style systolic arrays with block streaming for matrix multiplication and a highly parallel Jacobian unit with pipelined CORDIC rotations for SVD. A two-tier cache hierarchy and mode-aware memory policies adapt to the memory access patterns of covariance matrix computation and rotation computation. On Xilinx Virtex-Ultrascale+, MANOJAVAM(16,32) runs at 434 MHz consuming 16.957W and achieves up to 22.75x speedup in SVD latency and 42.14x reduction in total energy consumption compared to NVIDIA A6000 GPU on real-world datasets.
What carries the argument
MANOJAVAM(T,S) fabric, which unifies S TxT TPU-style systolic arrays with block streaming for matrix multiplication and a parallel Jacobian SVD unit using pipelined CORDIC rotations, plus two-tier cache hierarchy with mode-aware memory policies.
If this is right
- The accelerator supports PCA on datasets of any input dimension through its scalable T and S parameters.
- It reduces the need for separate hardware for matrix multiplication and SVD stages in PCA pipelines.
- Energy use for SVD operations drops sharply, supporting longer runs in power-limited edge devices.
- High-frequency operation on modern FPGAs enables real-time processing in data analytics.
- The unified fabric serves as a base for energy-efficient large-scale PCA in both high-performance and edge settings.
Where Pith is reading between the lines
- The design could extend to ASIC versions for further gains in speed and efficiency beyond FPGA limits.
- Similar systolic-plus-CORDIC unification might apply to other matrix-heavy tasks such as least-squares solvers.
- Hybrid systems pairing this FPGA fabric with CPUs could handle even larger datasets by offloading only the heavy stages.
- Adjusting cache depths for specific dataset sizes could be tested to push scalability further.
Load-bearing premise
The two-tier cache hierarchy and mode-aware memory policies adapt successfully to the distinct memory access patterns of covariance matrix computation and rotation computation without introducing unaccounted bottlenecks or scalability limits for arbitrary input dimensions.
What would settle it
Benchmarking MANOJAVAM(16,32) on a real-world dataset with larger dimensions than tested, where SVD latency speedup falls below 22.75x or total energy reduction falls below 42.14x relative to the NVIDIA A6000 GPU due to memory stalls.
Figures
read the original abstract
Principal Component Analysis (PCA) is widely used for dimensionality reduction in hyperspectral imaging, genomics, and neurosciences. However, it suffers from computational bottlenecks in matrix multiplication and singular value decomposition (SVD). Prior PCA hardware accelerators either target only one of these stages, rely on High Level Synthesis (HLS) that limits microarchitectural optimizations or use fixed point datapaths with limited dataset scalability. There is a need for a unified PCA accelerator that is suitable for datasets of any input dimension. Hence, the proposed work presents MANOJAVAM, a scalable PCA accelerator fabric, unifying matrix multiplication and SVD in a single architecture. MANOJAVAM(T,S) comprises an S number of TxT TPU-style systolic arrays employing block streaming for high-throughput matrix multiplication. It further integrates a highly parallel Jacobian unit implementing the Jacobi method for SVD with pipelined CORDIC based rotations. A two tier cache hierarchy and mode-aware memory policies adapts to the distinct memory access patterns of covariance matrix and rotation computation. For demonstration, MANOJAVAM(4,8) is realized on a Xilinx Artix-7 FPGA, achieving a frequency of 200 MHz at 1.271W. MANOJAVAM(16,32) is realized on Xilinx Virtex-Ultrascale+ FPGA, achieving a frequency of 434 MHz at 16.957W. Benchmarking on real-world datasets reveals that MANOJAVAM(16,32) achieves up to a 22.75x speedup in SVD latency and a 42.14x reduction in total energy consumption compared to a high-performance NVIDIA A6000 GPU. The architecture offers a unified, scalable, and energy-efficient platform for large-scale data analytics in both high-performance and edge-computing environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MANOJAVAM, a scalable unified FPGA accelerator fabric for PCA that combines block-streaming systolic arrays for matrix multiplication with a parallel Jacobi SVD unit using pipelined CORDIC rotations. It describes a two-tier cache hierarchy with mode-aware memory policies, reports FPGA implementations on Artix-7 (MANOJAVAM(4,8) at 200 MHz, 1.271 W) and Virtex-Ultrascale+ (MANOJAVAM(16,32) at 434 MHz, 16.957 W), and claims up to 22.75x SVD latency speedup and 42.14x total energy reduction versus an NVIDIA A6000 GPU on real-world datasets.
Significance. If the performance and energy claims are supported by fully specified, reproducible baselines and methodology, the work would offer a practical contribution to energy-efficient hardware for PCA in edge and high-performance analytics, providing a single fabric for the two dominant kernels rather than separate accelerators.
major comments (2)
- [Abstract] Abstract: The central claims of 22.75x SVD latency speedup and 42.14x energy reduction for MANOJAVAM(16,32) versus the A6000 GPU are presented without any information on the GPU baseline (library, e.g. cuSOLVER or custom Jacobi, optimization flags, precision), exact matrix dimensions or dataset sizes, latency/power measurement tools and scope (board-level vs. kernel-only, nvidia-smi vs. external meter), or statistical details such as error bars or number of runs. These omissions make the reported factors impossible to verify or reproduce and directly undermine the validity of the performance comparison.
- [Architecture] Architecture description (two-tier cache and mode-aware policies): The claim that the memory hierarchy successfully adapts to the distinct access patterns of covariance-matrix and rotation phases without introducing scalability bottlenecks for arbitrary input dimensions is stated but not supported by any quantitative data (cache hit rates, stall cycles, or ablation results across matrix sizes). This is load-bearing for the scalability assertion yet lacks the concrete evaluation needed to substantiate it.
minor comments (2)
- [Abstract] The abstract refers to 'real-world datasets' without naming them or giving sizes; the main text should explicitly list the datasets, their dimensions, and how they exercise the claimed scalability.
- [Implementation] Power figures are given as single values (1.271 W, 16.957 W) with no indication of measurement conditions or variation; add this detail for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the two major comments point-by-point below and will incorporate the requested clarifications and additional data into the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 22.75x SVD latency speedup and 42.14x energy reduction for MANOJAVAM(16,32) versus the A6000 GPU are presented without any information on the GPU baseline (library, e.g. cuSOLVER or custom Jacobi, optimization flags, precision), exact matrix dimensions or dataset sizes, latency/power measurement tools and scope (board-level vs. kernel-only, nvidia-smi vs. external meter), or statistical details such as error bars or number of runs. These omissions make the reported factors impossible to verify or reproduce and directly undermine the validity of the performance comparison.
Authors: We agree that the abstract and current experimental description omit critical details needed for reproducibility. In the revised manuscript we will expand the abstract and add a dedicated 'Experimental Setup' subsection that explicitly states: (1) the GPU baseline uses the cuSOLVER library (CUDA 11.8) with -O3 flags and double-precision arithmetic for the Jacobi SVD path; (2) the exact matrix dimensions drawn from the real-world datasets (e.g., 2048×2048 covariance matrices for the hyperspectral and genomics benchmarks); (3) latency measured via CUDA events and power via nvidia-smi at the board level, with kernel-only figures also reported; and (4) all results averaged over 10 independent runs with standard deviation and error bars. These additions will make the 22.75× latency and 42.14× energy claims directly verifiable. revision: yes
-
Referee: [Architecture] Architecture description (two-tier cache and mode-aware policies): The claim that the memory hierarchy successfully adapts to the distinct access patterns of covariance-matrix and rotation phases without introducing scalability bottlenecks for arbitrary input dimensions is stated but not supported by any quantitative data (cache hit rates, stall cycles, or ablation results across matrix sizes). This is load-bearing for the scalability assertion yet lacks the concrete evaluation needed to substantiate it.
Authors: The manuscript describes the two-tier cache hierarchy and mode-aware policies in Section IV, but we acknowledge that quantitative evidence (hit rates, stall cycles, ablation across sizes) is not provided. In the revision we will add a new evaluation subsection that reports FPGA performance-counter data from the Virtex-Ultrascale+ implementation for matrix sizes ranging from 512×512 to 4096×4096. These data will include L1/L2 hit rates, memory-stall cycles, and throughput scaling curves under both covariance and rotation modes, thereby substantiating that the policies prevent bottlenecks for the supported range of input dimensions. revision: yes
Circularity Check
No circularity: empirical FPGA implementation with direct measurements
full rationale
The paper presents a hardware architecture for PCA acceleration, its FPGA realization at specific scales (MANOJAVAM(4,8) on Artix-7, MANOJAVAM(16,32) on Virtex-Ultrascale+), and measured results for frequency, power, latency, and energy. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. Claims rest on synthesis reports and benchmarking rather than self-referential math or load-bearing self-citations. The GPU comparison is an external empirical baseline, not an internal derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- T and S (array size and count in MANOJAVAM(T,S))
axioms (2)
- domain assumption The Jacobi method converges reliably for the SVD computations performed in the parallel unit
- standard math Pipelined CORDIC provides sufficient accuracy and throughput for the required matrix rotations
Reference graph
Works this paper leans on
-
[1]
Design of ion-implanted mosfet’s with very small physical dimensions,
R. H. Dennard, F. H. Gaensslen, H.-N. Yu, V . L. Rideout, E. Bassous, and A. R. LeBlanc, “Design of ion-implanted mosfet’s with very small physical dimensions,”IEEE Journal of solid-state circuits, vol. 9, no. 5, pp. 256–268, 1974
1974
-
[2]
Fifty years of moore’s law,
C. A. Mack, “Fifty years of moore’s law,”IEEE Transactions on semiconductor manufacturing, vol. 24, no. 2, pp. 202–207, 2011
2011
-
[3]
Accelerator-rich architectures: Opportunities and progresses,
J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Rein- man, “Accelerator-rich architectures: Opportunities and progresses,” in Proceedings of the 51st annual design automation conference, 2014, pp. 1–6
2014
-
[4]
Understanding the efficiency of gpu algorithms for matrix-matrix multiplication,
K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of gpu algorithms for matrix-matrix multiplication,” in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference 15 on Graphics Hardware, ser. HWWS ’04. New York, NY , USA: Association for Computing Machinery, 2004, p. 133–137. [Online]. Available: https://doi.org/10.1145/1058129.1058148
-
[5]
A survey on deep learning hardware accelerators for heterogeneous hpc platforms,
C. Silvano, D. Ielmini, F. Ferrandi, L. Fiorin, S. Curzel, L. Benini, F. Conti, A. Garofalo, C. Zambelli, E. Caloreet al., “A survey on deep learning hardware accelerators for heterogeneous hpc platforms,”arXiv preprint arXiv:2306.15552, 2023
-
[6]
Dark silicon and the end of multicore scaling,
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” inPro- ceedings of the 38th annual international symposium on Computer architecture, 2011, pp. 365–376
2011
-
[7]
A survey on neural network hardware accelerators,
T. Mohaidat and K. Khalil, “A survey on neural network hardware accelerators,”IEEE Transactions on Artificial Intelligence, vol. 5, no. 8, pp. 3801–3822, 2024
2024
-
[8]
Hyperspectral image compression using jpeg2000 and principal component analysis,
Q. Du and J. E. Fowler, “Hyperspectral image compression using jpeg2000 and principal component analysis,”IEEE Geoscience and Remote sensing letters, vol. 4, no. 2, pp. 201–205, 2007
2007
-
[9]
Principal component analysis for hyper- spectral image classification,
C. Rodarmel and J. Shan, “Principal component analysis for hyper- spectral image classification,”Surveying and Land Information Science, vol. 62, no. 2, pp. 115–122, 2002
2002
-
[10]
Hardware accelerators for real-time face recognition: A survey,
A. Baobaid, M. Meribout, V . K. Tiwari, and J. P. Pena, “Hardware accelerators for real-time face recognition: A survey,”IEEE Access, vol. 10, pp. 83 723–83 739, 2022
2022
-
[11]
Dimensionality reduction in neuroscience,
R. Pang, B. J. Lansdell, and A. L. Fairhall, “Dimensionality reduction in neuroscience,”Current Biology, vol. 26, no. 14, pp. R656–R660, 2016
2016
-
[12]
Fast principal component analysis of large- scale genome-wide data,
G. Abraham and M. Inouye, “Fast principal component analysis of large- scale genome-wide data,”PloS one, vol. 9, no. 4, p. e93766, 2014
2014
-
[13]
Benchmarking principal component analysis for large-scale single-cell rna-sequencing,
K. Tsuyuzaki, H. Sato, K. Sato, and I. Nikaido, “Benchmarking principal component analysis for large-scale single-cell rna-sequencing,”Genome biology, vol. 21, no. 1, p. 9, 2020
2020
-
[14]
Principal component analysis,
M. Greenacre, P. J. Groenen, T. Hastie, A. I. d’Enza, A. Markos, and E. Tuzhilina, “Principal component analysis,”Nature Reviews Methods Primers, vol. 2, no. 1, p. 100, 2022
2022
-
[15]
A Tutorial on Principal Component Analysis
J. Shlens, “A tutorial on principal component analysis,”arXiv preprint arXiv:1404.1100, 2014
work page Pith review arXiv 2014
-
[16]
Principal component analysis,
R. Bro and A. K. Smilde, “Principal component analysis,”Analytical methods, vol. 6, no. 9, pp. 2812–2831, 2014
2014
-
[17]
A reconfigurable hardware archi- tecture for principal component analysis,
U. A. Korat and A. Alimohammad, “A reconfigurable hardware archi- tecture for principal component analysis,”Circuits, Systems, and Signal Processing, vol. 38, pp. 2097–2113, 2019
2097
-
[18]
Fpga implemen- tation of the principal component analysis algorithm for dimensionality reduction of hyperspectral images,
D. Fernandez, C. Gonzalez, D. Mozos, and S. Lopez, “Fpga implemen- tation of the principal component analysis algorithm for dimensionality reduction of hyperspectral images,”Journal of Real-Time Image Pro- cessing, vol. 16, pp. 1395–1406, 2019
2019
-
[19]
High level design of a flexible pca hardware accelerator using a new block-streaming method,
M. A. Mansoori and M. R. Casu, “High level design of a flexible pca hardware accelerator using a new block-streaming method,”Electronics, vol. 9, no. 3, p. 449, 2020
2020
-
[20]
An fpga-based network intrusion detection architecture,
A. Das, D. Nguyen, J. Zambreno, G. Memik, and A. Choudhary, “An fpga-based network intrusion detection architecture,”IEEE Transactions on Information Forensics and Security, vol. 3, no. 1, pp. 118–132, 2008
2008
-
[21]
Dynamic partial reconfigurable hardware architecture for principal component analysis on mobile and embedded devices,
S. N. Shahrouzi and D. G. Perera, “Dynamic partial reconfigurable hardware architecture for principal component analysis on mobile and embedded devices,”EURASIP Journal on Embedded Systems, vol. 2017, pp. 1–18, 2017
2017
-
[22]
A streaming pca vlsi chip for neural data compression,
T. Wu, W. Zhao, H. Guo, H. H. Lim, and Z. Yang, “A streaming pca vlsi chip for neural data compression,”IEEE transactions on biomedical circuits and systems, vol. 11, no. 6, pp. 1290–1302, 2017
2017
-
[23]
Preliminary results from a 49-channel neural recording asic with embedded spike compression in 28 nm cmos,
W. Lemaire, E. R. Koleibi, T. Omrani, M. Benhouria, K. Koua, C. Ques- nel, L.-P. Gauthier, J. M ´enard, K. Gagnon, S. Royet al., “Preliminary results from a 49-channel neural recording asic with embedded spike compression in 28 nm cmos,” in2022 20th IEEE Interregional NEWCAS Conference (NEWCAS). IEEE, 2022, pp. 285–289
2022
-
[24]
A principal component neural network-based face recognition system and asic implementation,
C. S. S. Prasanna, N. Sudha, and V . Kamakoti, “A principal component neural network-based face recognition system and asic implementation,” in18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design. IEEE, 2005, pp. 795–798
2005
-
[25]
Fpga-based fully parallel pca-ann for spectrum sensing,
A. Elrharras, S. El Moukhlis, R. Saadane, M. Wahbi, and A. Hamdoun, “Fpga-based fully parallel pca-ann for spectrum sensing,”Computer and Information Science, vol. 8, no. 1, p. 108, 2015
2015
-
[26]
Fpga-based odor classification system using principal component analysis,
T. Tongyoo and Y . Ariyakul, “Fpga-based odor classification system using principal component analysis,” in2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST). IEEE, 2018, pp. 1–4
2018
-
[27]
Hardware pca for gas identification systems using high level synthesis on the zynq soc,
A. A. S. Ali, A. Amira, F. Bensaali, and M. Benammar, “Hardware pca for gas identification systems using high level synthesis on the zynq soc,” in2013 IEEE 20th International Conference on Electronics, Circuits, and Systems (ICECS). IEEE, 2013, pp. 707–710
2013
-
[28]
Accelerating svd computation on fpgas for dsp systems,
Y . Ma and D. Wang, “Accelerating svd computation on fpgas for dsp systems,” in2016 IEEE 13th International Conference on Signal Processing (ICSP). IEEE, 2016, pp. 487–490
2016
-
[29]
Reconfigurable adaptive singular value decomposition engine design for high-throughput mimo-ofdm systems,
Y .-L. Chen, C.-Z. Zhan, T.-J. Jheng, and A.-Y . Wu, “Reconfigurable adaptive singular value decomposition engine design for high-throughput mimo-ofdm systems,”IEEE transactions on very large scale integration (VLSI) systems, vol. 21, no. 4, pp. 747–760, 2012
2012
-
[30]
An fpga implementation of the hestenes- jacobi algorithm for singular value decomposition,
X. Wang and J. Zambreno, “An fpga implementation of the hestenes- jacobi algorithm for singular value decomposition,” in2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE, 2014, pp. 220–227
2014
-
[31]
Fast implementation for the singular value and eigenvalue decomposition based on fpga,
S. Zhang, X. Tian, C. Xiong, J. Tian, and D. Ming, “Fast implementation for the singular value and eigenvalue decomposition based on fpga,” Chinese Journal of Electronics, vol. 26, no. 1, pp. 132–136, 2017
2017
-
[32]
Real-time signal processing of massive sensor arrays via a parallel fast converging svd algorithm: Latency, throughput, and resource analysis,
M. V . Athi, S. R. Zekavat, and A. A. Struthers, “Real-time signal processing of massive sensor arrays via a parallel fast converging svd algorithm: Latency, throughput, and resource analysis,”IEEE Sensors Journal, vol. 16, no. 8, pp. 2519–2526, 2016
2016
-
[33]
Fpga, gpu, and cpu implementations of jacobi algorithm for eigenanalysis,
M. U. Torun, O. Yilmaz, and A. N. Akansu, “Fpga, gpu, and cpu implementations of jacobi algorithm for eigenanalysis,”Journal of Parallel and Distributed Computing, vol. 96, pp. 172–180, 2016
2016
-
[34]
The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,
R. P. Brent and F. T. Luk, “The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays,”SIAM Journal on Scientific and Statistical Computing, vol. 6, no. 1, pp. 69–84, 1985. [Online]. Available: https://doi.org/10.1137/0906007
-
[35]
A survey of cordic algorithms for fpga based computers,
R. Andraka, “A survey of cordic algorithms for fpga based computers,” inProceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays, ser. FPGA ’98. New York, NY , USA: Association for Computing Machinery, 1998, p. 191–200. [Online]. Available: https://doi.org/10.1145/275107.275139
-
[36]
Calculating the singular values and pseudo- inverse of a matrix,
G. Golub and W. Kahan, “Calculating the singular values and pseudo- inverse of a matrix,”Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, vol. 2, no. 2, pp. 205–224, 1965
1965
-
[37]
Music-lite: Efficient music using approximate computing: An ofdm radar case study,
R. Bhattacharjya, A. Sarkar, B. Maity, and N. Dutt, “Music-lite: Efficient music using approximate computing: An ofdm radar case study,”IEEE Embedded Systems Letters, vol. 16, no. 4, pp. 329–332, 2024
2024
-
[38]
The cordic trigonometric computing technique,
J. E. V older, “The cordic trigonometric computing technique,”IRE Transactions on electronic computers, no. 3, pp. 330–334, 1959
1959
-
[39]
50 years of cordic: Algorithms, architectures, and applications,
P. K. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, “50 years of cordic: Algorithms, architectures, and applications,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56, no. 9, pp. 1893–1907, 2009
1907
-
[40]
Charm 2.0: Composing heterogeneous accelerators for deep learning on versal acap architecture,
J. Zhuang, J. Lau, H. Ye, Z. Yang, S. Ji, J. Lo, K. Denolf, S. Neuen- dorffer, A. Jones, J. Huet al., “Charm 2.0: Composing heterogeneous accelerators for deep learning on versal acap architecture,”ACM Trans- actions on Reconfigurable Technology and Systems, vol. 17, no. 3, pp. 1–31, 2024
2024
-
[41]
Highly efficient self-checking matrix multiplication on tiled amx accelerators,
C. S. Mummidi, V . C. Ferreira, S. Srinivasan, and S. Kundu, “Highly efficient self-checking matrix multiplication on tiled amx accelerators,” ACM Transactions on Architecture and Code Optimization, vol. 21, no. 2, pp. 1–22, 2024
2024
-
[42]
In-datacenter performance analysis of a tensor processing unit,
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12
2017
-
[43]
Siracusa: A 16 nm heterogenous risc-v soc for extended reality with at-mram neural engine,
A. S. Prasad, M. Scherer, F. Conti, D. Rossi, A. Di Mauro, M. Eggimann, J. T. G ´omez, Z. Li, S. S. Sarwar, Z. Wanget al., “Siracusa: A 16 nm heterogenous risc-v soc for extended reality with at-mram neural engine,”IEEE Journal of Solid-State Circuits, 2024
2024
-
[44]
Autoai2c: An automated hardware generator for dnn acceleration on both fpga and asic,
Y . Zhang, X. Zhang, P. Xu, Y . Zhao, C. Hao, D. Chen, and Y . Lin, “Autoai2c: An automated hardware generator for dnn acceleration on both fpga and asic,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024
2024
-
[45]
Hdreason: Algorithm-hardware codesign for hyperdimensional knowledge graph reasoning,
H. Chen, Y . Ni, A. Zakeri, Z. Zou, S. Yun, F. Wen, B. Khaleghi, N. Srinivasa, H. Latapie, and M. Imani, “Hdreason: Algorithm-hardware codesign for hyperdimensional knowledge graph reasoning,”arXiv preprint arXiv:2403.05763, 2024
-
[46]
S2tar: Shared secure trusted accelerators with reconfiguration for machine learning in the cloud,
W. Ren, S. Koteshwara, M. Ye, H. Franke, and D. Chen, “S2tar: Shared secure trusted accelerators with reconfiguration for machine learning in the cloud,” in2024 IEEE 17th International Conference on Cloud Computing (CLOUD). IEEE, 2024, pp. 267–278
2024
-
[47]
Hardware-assisted virtualization of neural processing units for cloud platforms,
Y . Xue, Y . Liu, L. Nai, and J. Huang, “Hardware-assisted virtualization of neural processing units for cloud platforms,” in2024 57th IEEE/ACM 16 International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 1–16
2024
-
[48]
A unified engine for accelerating gnn weighting/aggregation operations, with efficient load balancing and graph-specific caching,
S. Mondal, S. D. Manasi, K. Kunal, Z. Zeng, S. S. Sapatnekaret al., “A unified engine for accelerating gnn weighting/aggregation operations, with efficient load balancing and graph-specific caching,”IEEE Trans- actions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 12, pp. 4844–4857, 2022
2022
-
[49]
Dynasparse: Accelerating gnn inference through dynamic sparsity exploitation,
B. Zhang and V . Prasanna, “Dynasparse: Accelerating gnn inference through dynamic sparsity exploitation,” in2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2023, pp. 233–244
2023
-
[50]
Me-vit: A single-load memory-efficient fpga accelerator for vision transformers,
K. Marino, P. Zhang, and V . K. Prasanna, “Me-vit: A single-load memory-efficient fpga accelerator for vision transformers,” in2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 2023, pp. 213–223
2023
-
[51]
Dap: A 507-gmacs/j 256-core domain adaptive processor for wireless communication and linear algebra kernels in 12-nm finfet,
K.-Y . Chen, C.-S. Yang, Y .-H. Sun, C.-W. Tseng, M. Fayazi, X. He, S. Feng, Y . Yue, T. Mudge, R. Dreslinskiet al., “Dap: A 507-gmacs/j 256-core domain adaptive processor for wireless communication and linear algebra kernels in 12-nm finfet,”IEEE Journal of Solid-State Circuits, 2024
2024
-
[52]
Robovisio: A micro-robot vision domain-specific soc for autonomous navigation enabling fully-on-chip intelligence via 2-mb emram,
Q. Zhang, Z. Fan, H. An, Z. Wang, Z. Li, G. Wang, P. Abillama, H.-S. Kim, D. Blaauw, and D. Sylvester, “Robovisio: A micro-robot vision domain-specific soc for autonomous navigation enabling fully-on-chip intelligence via 2-mb emram,”IEEE Journal of Solid-State Circuits, 2024
2024
-
[53]
Tetrix: Flexible architecture and optimal mapping for tensorized neural network processing,
J.-F. Zhang, C.-H. Lu, and Z. Zhang, “Tetrix: Flexible architecture and optimal mapping for tensorized neural network processing,”IEEE Transactions on Computers, 2024
2024
-
[54]
The livermore fortran kernels: A computer test of the numerical performance range,
F. H. McMahon, “The livermore fortran kernels: A computer test of the numerical performance range,” Lawrence Livermore National Lab., CA (USA), Tech. Rep., 1986
1986
-
[55]
Cache write policies and performance,
N. P. Jouppi, “Cache write policies and performance,”ACM SIGARCH Computer Architecture News, vol. 21, no. 2, pp. 191–201, 1993
1993
-
[56]
Bunch, and gw stewart,
J. Dongarra, C. Moler, and R. James, “Bunch, and gw stewart,”LIN- PACK users’ guide. SIAM, 1979
1979
-
[57]
An intelligent architecture based on field programmable gate arrays de- signed to detect moving objects by using principal component analysis,
I. Bravo, M. Mazo, J. L. L ´azaro, A. Gardel, P. Jim ´enez, and D. Pizarro, “An intelligent architecture based on field programmable gate arrays de- signed to detect moving objects by using principal component analysis,” Sensors, vol. 10, no. 10, pp. 9232–9251, 2010
2010
-
[58]
Novel field-programmable gate array archi- tecture for computing the eigenvalue decomposition of para-hermitian polynomial matrices,
S. Kasap and S. Redif, “Novel field-programmable gate array archi- tecture for computing the eigenvalue decomposition of para-hermitian polynomial matrices,”IEEE Transactions on Very Large Scale Integra- tion (VLSI) Systems, vol. 22, no. 3, pp. 522–536, 2013
2013
-
[59]
Optical Recognition of Hand- written Digits,
E. Alpaydin and C. Kaynak, “Optical Recognition of Hand- written Digits,” UCI Machine Learning Repository, 1998, DOI: https://doi.org/10.24432/C50P49
-
[60]
Gradient-based learning applied to document recognition,
Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998
1998
-
[61]
Learning multiple layers of features from tiny images,
A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009
2009
-
[62]
Olivetti faces dataset,
“Olivetti faces dataset,” AT&T Laboratories Cambridge (1992–1994), available via scikit-learn, 1994, https://scikit-learn.org/0.19/datasets/ olivetti faces.html
1992
-
[63]
Street.Breast Cancer Wisconsin (Diagnostic)
W. Wolberg, O. Mangasarian, N. Street, and W. Street, “Breast Cancer Wisconsin (Diagnostic),” UCI Machine Learning Repository, 1993, DOI: https://doi.org/10.24432/C5DW2B
-
[64]
T. Mitchell, “Twenty Newsgroups,” UCI Machine Learning Repository, 1997, DOI: https://doi.org/10.24432/C5C323
-
[65]
NVIDIA Ampere GA102 GPU Ar- chitecture Whitepaper,
NVIDIA Corporation, “NVIDIA Ampere GA102 GPU Ar- chitecture Whitepaper,” NVIDIA Corporation, Tech. Rep.,
-
[66]
Available: https://www.nvidia.com/content/PDF/ nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Srivaths Ramasubramanian(Student Member, IEEE) received the B.E
[Online]. Available: https://www.nvidia.com/content/PDF/ nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Srivaths Ramasubramanian(Student Member, IEEE) received the B.E. degree in Electronics and Communication Engineering from Rashtreeya Vidyalaya College of Engineering (RVCE), Ben- galuru, India, in 2025. He is currently pursuing the Ph.D. degree...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.