Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Blaise Tine; Nikhil Rout

arxiv: 2512.00053 · v2 · submitted 2025-11-19 · 💻 cs.AR

Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Nikhil Rout , Blaise Tine This is my paper

Pith reviewed 2026-05-17 20:32 UTC · model grok-4.3

classification 💻 cs.AR

keywords fused dot productmixed-precision arithmetictensor coreGPGPUFPGA implementationRISC-Vmatrix multiply-accumulateopen-source hardware

0 comments

The pith

Ten-Four fuses floating-point and integer pipelines into one dot-product unit that runs mixed-precision matrix operations in four cycles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ten-Four as a scalable mixed-precision fused dot product unit built for the open-source Vortex GPGPU Tensor Core. It combines floating-point and integer arithmetic paths to handle multiplications in FP16, BF16, FP8, BF8, INT8, and INT4 formats while accumulating results in FP32 or INT32. The design adds native Microscaling support and sparse lane clock-gating for power savings. On an AMD Xilinx Alveo U55C FPGA it reaches 4-cycle latency at 262.325 MHz, yielding 134.308 GFLOPS per Tensor Core and a 3.1 times speedup over a Berkeley HardFloat version at under 60 percent the area while matching NVIDIA numerical accuracy. This matters for open-source GPGPU development because discrete arithmetic units have historically added latency, rounding error, and wasted silicon in deep-learning accelerators.

Core claim

Ten-Four integrates both the floating-point and integer arithmetic pipelines within a single fused architecture that supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native Microscaling and sparse lane clock-gating, achieving 4-cycle operation latency at 262.325 MHz Fmax and 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA while delivering approximately 3.1 times the performance of an equivalent Berkeley HardFloat-based implementation at less than 60 percent the area cost and matching NVIDIA Tensor Core numerical accuracy.

What carries the argument

A single fused dot-product architecture that merges floating-point and integer pipelines to perform multiplication and accumulation without intermediate rounding or separate units.

If this is right

Matrix-multiply-accumulate operations inside open-source GPGPUs can now complete in four cycles instead of the higher latency of discrete units.
Resource utilization improves because a single pipeline replaces multiple separate arithmetic blocks.
Dynamic power drops further through built-in sparse lane clock-gating when many lanes are inactive.
Designers gain an open-source drop-in unit that already matches commercial Tensor Core accuracy for mixed-precision workloads.
The same fused structure scales to additional low-precision formats without redesigning separate adders or multipliers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other open-source GPU projects could adopt the same fused pipeline to reduce their own Tensor Core area and latency budgets.
Real silicon measurements on a fabricated chip rather than FPGA emulation would reveal whether clock frequency or power numbers shift under sustained AI workloads.
The Microscaling support already present could be extended to newer formats such as FP4 or FP6 once the base unit is verified.
Integration with higher-level compilers would let software teams automatically choose the fused unit for any matrix operation that matches the supported precisions.

Load-bearing premise

The fused pipeline produces exactly the same numerical results as separate discrete units for every supported format and every input pattern that arises inside the full Vortex Tensor Core.

What would settle it

A side-by-side numerical comparison of Ten-Four outputs against a reference discrete-unit implementation for thousands of random and corner-case inputs across all six multiplication formats, or a full integration test inside the Vortex Tensor Core that shows any deviation in accumulated results.

Figures

Figures reproduced from arXiv: 2512.00053 by Blaise Tine, Nikhil Rout.

**Figure 2.** Figure 2: FEDP Backends Performance Scaling (FP16/BF16) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic units, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a scalable mixed-precision fused dot product unit that integrates both the floating-point and integer arithmetic pipelines within a single fused architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native support for Microscaling (MX) and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core's numerical accuracy. Ten-Four achieves 4-cycle operation latency at 262.325 MHz Fmax, delivering 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA, demonstrating ~3.1x performance improvement over an equivalent Berkeley HardFloat-based implementation at less than 60% the area cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper ships a concrete fused mixed-precision dot-product unit integrated into the open Vortex GPGPU, with usable FPGA numbers, but the numerical equivalence checks for corner cases look thin.

read the letter

This paper ships a concrete fused mixed-precision dot-product unit integrated into the open Vortex GPGPU, with usable FPGA numbers, but the numerical equivalence checks for corner cases look thin. They combined the FP and INT pipelines into one architecture that handles FP16, BF16, FP8, BF8, INT8, and INT4 multiplies with FP32 or INT32 accumulation, plus built-in MX scaling and sparse lane clock-gating for power savings. On the Alveo U55C they report 4-cycle latency at 262 MHz, 134 GFLOPS per core, and roughly 3x better performance than a HardFloat baseline at under 60% the area. That is the main deliverable: a working, open implementation rather than a new algorithm or theoretical bound. The integration into an existing open-source GPU project is the part that makes it more than a standalone RTL block. Reporting real post-synthesis frequency, throughput, and area on a named FPGA gives readers something they can actually try or compare against. The design choices around fusion and dynamic gating are practical for FPGA targets where resources and power matter. The soft spot is the accuracy claim. The abstract says the fused unit matches NVIDIA numerical accuracy and stays equivalent to discrete units across the supported formats. The stress-test note flags that without shown test vectors or methodology for denormals, NaNs, or accumulation overflow, it is not clear whether fusion introduced any hidden rounding differences. If the full manuscript has a verification section that covers those cases with direct comparisons, the concern disappears. If not, reviewers will want that evidence added. This is aimed at people building or extending open GPGPUs and custom tensor cores on FPGA. A reader who needs reusable RTL ideas or concrete implementation measurements will find value here. The work shows clear engineering thinking and honest focus on open-source constraints, so it deserves a serious referee even if the verification details need tightening. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Ten-Four, a scalable open-source fused dot-product unit for mixed-precision MMA operations integrated into the Vortex RISC-V GPGPU Tensor Core. It fuses FP and INT pipelines to support multiplication in FP16/BF16/FP8/BF8/INT8/INT4 with accumulation in FP32/INT32, adds native MX microscaling and sparse lane clock-gating, and reports 4-cycle latency at 262.325 MHz Fmax on the AMD Xilinx Alveo U55C, delivering 134.308 GFLOPS per Tensor Core with ~3.1× throughput improvement and <60 % area relative to a Berkeley HardFloat baseline while claiming bit-identical numerical accuracy to NVIDIA Tensor Cores.

Significance. If the reported FPGA measurements and numerical equivalence hold, the work supplies a concrete, reproducible open-source building block for low-precision tensor operations on an open GPGPU platform. The fused architecture and concrete post-synthesis numbers (frequency, latency, throughput, area) constitute a useful reference point for the community working on hardware accelerators for deep learning.

major comments (1)

[§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.

minor comments (2)

[Table 2] Table 2 (resource utilization): clarify whether the reported LUT/FF/DSP counts include or exclude the MX scaling logic and sparse-gating circuitry.
[Figure 4] Figure 4 (pipeline diagram): the boundary between the fused FP and INT paths is not labeled with cycle-accurate stage boundaries, making it difficult to verify the stated 4-cycle latency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment on the verification aspects of our work. We address the major comment point by point below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.

Authors: We agree that the current manuscript does not provide explicit test-vector suites, coverage metrics, or side-by-side tables for the corner cases in FP8/BF8/INT4. While our internal verification process included targeted test vectors for denormals, NaNs, and accumulation overflow (generated both randomly and from known edge-case patterns) and confirmed bit-identical behavior against separate Berkeley HardFloat units as well as matching NVIDIA Tensor Core results where defined, these details were omitted due to page limits. In the revised manuscript we will expand the verification subsection in §5 to include: (1) a description of the test-vector generation methodology, (2) coverage metrics for the relevant IEEE 754 and MX corner cases, and (3) concise side-by-side comparison tables for representative denormal, NaN, and overflow scenarios. This addition will make the numerical-equivalence claims fully reproducible without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: performance metrics are direct FPGA synthesis results

full rationale

The paper reports an FPGA implementation of a fused dot-product unit with measured outcomes (4-cycle latency at 262.325 MHz, 134.308 GFLOPS, ~3.1x speedup, <60% area) obtained from synthesis and timing analysis on the Alveo U55C. These are empirical hardware results rather than predictions or derivations that reduce to fitted parameters or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that loop back to the inputs by construction. The numerical-accuracy claim is presented as a design goal matching NVIDIA Tensor Cores but is not used as a load-bearing derivation step within the paper itself. The contribution is therefore self-contained as an implementation artifact.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard digital design assumptions and re-uses existing open-source arithmetic libraries rather than introducing new mathematical axioms or fitted constants.

axioms (1)

standard math Standard assumptions of synchronous digital design, FPGA synthesis tools, and IEEE floating-point rounding modes hold for the target platform.
Invoked implicitly when claiming 4-cycle latency and numerical accuracy matching NVIDIA Tensor Cores.

pith-pipeline@v0.9.0 · 5522 in / 1332 out tokens · 30296 ms · 2026-05-17T20:32:14.255485+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a configurable 4-stage fused dot product architecture supporting low-precision (FP16/BF16/FP8/BF8) multiplication with FP32 accumulation... MOD-4 CSA accumulator structure.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Patterson, Brian Richards, Colin Schmidt, Stephen Twigg, Huy Vo, and Andrew Waterman

Krste Asanović, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, David Bian- colin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, Adam Izraele- vitz, Sagar Karandikar, Ben Keller, Donggyu Kim, John Koenig, Yunsup Lee, Eric Love, Martin Maas, Albert Magyar, Howard Mao, Miquel Moreto, Albert Ou, David A. Patterson, Brian Richards, Colin Schmi...

work page 2016
[2]

Luca Bertaccini, Gianna Paulin, Matheus Cavalcante, Tim Fischer, Stefan Mach, and Luca Benini. 2024. MiniFloats on RISC-V Cores: ISA Extensions With Mixed- Precision Short Dot Products.IEEE Transactions on Emerging Topics in Computing 12, 4 (2024), 1040–1055. doi:10.1109/TETC.2024.3365354

work page doi:10.1109/tetc.2024.3365354 2024
[3]

Bruintjes, Karel H

Tom M. Bruintjes, Karel H. G. Walters, Sabih H. Gerez, Bert Molenkamp, and Gerard J. M. Smit. 2012. Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic.ACM Trans. Archit. Code Optim.8, 4, Article 41 (Jan. 2012), 22 pages. doi:10.1145/2086696.2086720

work page doi:10.1145/2086696.2086720 2012
[4]

Stef Cuyckens, Xiaoling Yi, Nitish Satya Murthy, Chao Fang, and Marian Verhelst

work page
[5]

arXiv:2505.22404 [cs.AR] https://arxiv.org/abs/2505.22404

Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning. arXiv:2505.22404 [cs.AR] https://arxiv.org/abs/2505.22404

work page arXiv
[6]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Fu...

work page 2021
[7]

John R. Hauser. 2019. Berkeley HardFloat Floating-Point Arithmetic Package, Re- lease 1. https://www.jhauser.us/arithmetic/HardFloat.html. Accessed: September 5, 2025

work page 2019
[8]

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] https://arxiv.org/abs/1804.06826

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Hansung Kim, Ruohan Richard Yan, Joshua You, Tieliang Vamber Yang, and Yakun Sophia Shao. 2025. Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ...

work page doi:10.1145/3676641.3716281 2025
[10]

Qiong Li, Chao Fang, and Zhongfeng Wang. 2023. PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications. In2023 IEEE International Sym- posium on Circuits and Systems (ISCAS). IEEE, USA, 1–5. doi:10.1109/ISCAS46773. 2023.10182007

work page doi:10.1109/iscas46773 2023
[11]

Stefan Mach, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing.IEEE Transactions on Very Large Scale Integration (VLSI) Systems29, 4 (2020), 774–787

work page 2020
[12]

Abubakr Nada, Giuseppe Maria Sarda, and Erwan Lenormand. 2025. Coopera- tive Warp Execution in Tensor Core for RISC-V GPGPU. In2025 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). 1422–1436. doi:10.1109/HPCA61900.2025.00107

work page doi:10.1109/hpca61900.2025.00107 2025
[13]

2017.NVIDIA Tesla V100 GPU Architecture

NVIDIA Corporation. 2017.NVIDIA Tesla V100 GPU Architecture. Techni- cal Report. https://images.nvidia.com/content/volta-architecture/pdf/volta- architecture-whitepaper.pdf

work page 2017
[14]

Md Aamir Raihan, Negar Goli, and Tor M. Aamodt. 2019. Modeling Deep Learning Accelerator Enabled GPUs. In2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 79–92. doi:10.1109/ISPASS.2019.00016

work page doi:10.1109/ispass.2019.00016 2019
[15]

Swartzlander

Jongwook Sohn and Earl E. Swartzlander. 2016. A Fused Floating-Point Four- Term Dot Product Unit.IEEE Transactions on Circuits and Systems I: Regular Papers63, 3 (2016), 370–378. doi:10.1109/TCSI.2016.2525042

work page doi:10.1109/tcsi.2016.2525042 2016
[16]

Blaise Tine and Nikhil Rout. 2025. Vortex GPGPU Tensor Core Unit Extension FEDP DRL RTL Backend. https://github.com/vortexgpgpu/vortex/tree/bug_fixes/ hw/rtl/tcu/drl

work page 2025
[17]

Blaise Tine, Krishna Praveen Yalamarthy, Fares Elsabbagh, and Kim Hyesoon

work page
[18]

Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,

Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics. InMICRO- 54: 54th Annual IEEE/ACM International Symposium on Microarchitecture(Virtual Event, Greece)(MICRO ’21). Association for Computing Machinery, New York, NY, USA, 754–766. doi:10.1145/3466752.3480128

work page doi:10.1145/3466752.3480128
[19]

Hao Zhang, Dongdong Chen, and Seok-Bum Ko. 2019. Efficient Multiple-Precision Floating-Point Fused Multiply-Add with Mixed-Precision Support.IEEE Trans. Comput.68, 7 (2019), 1035–1048. doi:10.1109/TC.2019.2895031

work page doi:10.1109/tc.2019.2895031 2019

[1] [1]

Patterson, Brian Richards, Colin Schmidt, Stephen Twigg, Huy Vo, and Andrew Waterman

Krste Asanović, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, David Bian- colin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, Adam Izraele- vitz, Sagar Karandikar, Ben Keller, Donggyu Kim, John Koenig, Yunsup Lee, Eric Love, Martin Maas, Albert Magyar, Howard Mao, Miquel Moreto, Albert Ou, David A. Patterson, Brian Richards, Colin Schmi...

work page 2016

[2] [2]

Luca Bertaccini, Gianna Paulin, Matheus Cavalcante, Tim Fischer, Stefan Mach, and Luca Benini. 2024. MiniFloats on RISC-V Cores: ISA Extensions With Mixed- Precision Short Dot Products.IEEE Transactions on Emerging Topics in Computing 12, 4 (2024), 1040–1055. doi:10.1109/TETC.2024.3365354

work page doi:10.1109/tetc.2024.3365354 2024

[3] [3]

Bruintjes, Karel H

Tom M. Bruintjes, Karel H. G. Walters, Sabih H. Gerez, Bert Molenkamp, and Gerard J. M. Smit. 2012. Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic.ACM Trans. Archit. Code Optim.8, 4, Article 41 (Jan. 2012), 22 pages. doi:10.1145/2086696.2086720

work page doi:10.1145/2086696.2086720 2012

[4] [4]

Stef Cuyckens, Xiaoling Yi, Nitish Satya Murthy, Chao Fang, and Marian Verhelst

work page

[5] [5]

arXiv:2505.22404 [cs.AR] https://arxiv.org/abs/2505.22404

Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning. arXiv:2505.22404 [cs.AR] https://arxiv.org/abs/2505.22404

work page arXiv

[6] [6]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Fu...

work page 2021

[7] [7]

John R. Hauser. 2019. Berkeley HardFloat Floating-Point Arithmetic Package, Re- lease 1. https://www.jhauser.us/arithmetic/HardFloat.html. Accessed: September 5, 2025

work page 2019

[8] [8]

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] https://arxiv.org/abs/1804.06826

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Hansung Kim, Ruohan Richard Yan, Joshua You, Tieliang Vamber Yang, and Yakun Sophia Shao. 2025. Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ...

work page doi:10.1145/3676641.3716281 2025

[10] [10]

Qiong Li, Chao Fang, and Zhongfeng Wang. 2023. PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications. In2023 IEEE International Sym- posium on Circuits and Systems (ISCAS). IEEE, USA, 1–5. doi:10.1109/ISCAS46773. 2023.10182007

work page doi:10.1109/iscas46773 2023

[11] [11]

Stefan Mach, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing.IEEE Transactions on Very Large Scale Integration (VLSI) Systems29, 4 (2020), 774–787

work page 2020

[12] [12]

Abubakr Nada, Giuseppe Maria Sarda, and Erwan Lenormand. 2025. Coopera- tive Warp Execution in Tensor Core for RISC-V GPGPU. In2025 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). 1422–1436. doi:10.1109/HPCA61900.2025.00107

work page doi:10.1109/hpca61900.2025.00107 2025

[13] [13]

2017.NVIDIA Tesla V100 GPU Architecture

NVIDIA Corporation. 2017.NVIDIA Tesla V100 GPU Architecture. Techni- cal Report. https://images.nvidia.com/content/volta-architecture/pdf/volta- architecture-whitepaper.pdf

work page 2017

[14] [14]

Md Aamir Raihan, Negar Goli, and Tor M. Aamodt. 2019. Modeling Deep Learning Accelerator Enabled GPUs. In2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 79–92. doi:10.1109/ISPASS.2019.00016

work page doi:10.1109/ispass.2019.00016 2019

[15] [15]

Swartzlander

Jongwook Sohn and Earl E. Swartzlander. 2016. A Fused Floating-Point Four- Term Dot Product Unit.IEEE Transactions on Circuits and Systems I: Regular Papers63, 3 (2016), 370–378. doi:10.1109/TCSI.2016.2525042

work page doi:10.1109/tcsi.2016.2525042 2016

[16] [16]

Blaise Tine and Nikhil Rout. 2025. Vortex GPGPU Tensor Core Unit Extension FEDP DRL RTL Backend. https://github.com/vortexgpgpu/vortex/tree/bug_fixes/ hw/rtl/tcu/drl

work page 2025

[17] [17]

Blaise Tine, Krishna Praveen Yalamarthy, Fares Elsabbagh, and Kim Hyesoon

work page

[18] [18]

Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,

Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics. InMICRO- 54: 54th Annual IEEE/ACM International Symposium on Microarchitecture(Virtual Event, Greece)(MICRO ’21). Association for Computing Machinery, New York, NY, USA, 754–766. doi:10.1145/3466752.3480128

work page doi:10.1145/3466752.3480128

[19] [19]

Hao Zhang, Dongdong Chen, and Seok-Bum Ko. 2019. Efficient Multiple-Precision Floating-Point Fused Multiply-Add with Mixed-Precision Support.IEEE Trans. Comput.68, 7 (2019), 1035–1048. doi:10.1109/TC.2019.2895031

work page doi:10.1109/tc.2019.2895031 2019