Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores
Pith reviewed 2026-05-17 20:32 UTC · model grok-4.3
The pith
Ten-Four fuses floating-point and integer pipelines into one dot-product unit that runs mixed-precision matrix operations in four cycles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ten-Four integrates both the floating-point and integer arithmetic pipelines within a single fused architecture that supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native Microscaling and sparse lane clock-gating, achieving 4-cycle operation latency at 262.325 MHz Fmax and 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA while delivering approximately 3.1 times the performance of an equivalent Berkeley HardFloat-based implementation at less than 60 percent the area cost and matching NVIDIA Tensor Core numerical accuracy.
What carries the argument
A single fused dot-product architecture that merges floating-point and integer pipelines to perform multiplication and accumulation without intermediate rounding or separate units.
If this is right
- Matrix-multiply-accumulate operations inside open-source GPGPUs can now complete in four cycles instead of the higher latency of discrete units.
- Resource utilization improves because a single pipeline replaces multiple separate arithmetic blocks.
- Dynamic power drops further through built-in sparse lane clock-gating when many lanes are inactive.
- Designers gain an open-source drop-in unit that already matches commercial Tensor Core accuracy for mixed-precision workloads.
- The same fused structure scales to additional low-precision formats without redesigning separate adders or multipliers.
Where Pith is reading between the lines
- Other open-source GPU projects could adopt the same fused pipeline to reduce their own Tensor Core area and latency budgets.
- Real silicon measurements on a fabricated chip rather than FPGA emulation would reveal whether clock frequency or power numbers shift under sustained AI workloads.
- The Microscaling support already present could be extended to newer formats such as FP4 or FP6 once the base unit is verified.
- Integration with higher-level compilers would let software teams automatically choose the fused unit for any matrix operation that matches the supported precisions.
Load-bearing premise
The fused pipeline produces exactly the same numerical results as separate discrete units for every supported format and every input pattern that arises inside the full Vortex Tensor Core.
What would settle it
A side-by-side numerical comparison of Ten-Four outputs against a reference discrete-unit implementation for thousands of random and corner-case inputs across all six multiplication formats, or a full integration test inside the Vortex Tensor Core that shows any deviation in accumulated results.
Figures
read the original abstract
Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic units, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a scalable mixed-precision fused dot product unit that integrates both the floating-point and integer arithmetic pipelines within a single fused architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native support for Microscaling (MX) and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core's numerical accuracy. Ten-Four achieves 4-cycle operation latency at 262.325 MHz Fmax, delivering 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA, demonstrating ~3.1x performance improvement over an equivalent Berkeley HardFloat-based implementation at less than 60% the area cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Ten-Four, a scalable open-source fused dot-product unit for mixed-precision MMA operations integrated into the Vortex RISC-V GPGPU Tensor Core. It fuses FP and INT pipelines to support multiplication in FP16/BF16/FP8/BF8/INT8/INT4 with accumulation in FP32/INT32, adds native MX microscaling and sparse lane clock-gating, and reports 4-cycle latency at 262.325 MHz Fmax on the AMD Xilinx Alveo U55C, delivering 134.308 GFLOPS per Tensor Core with ~3.1× throughput improvement and <60 % area relative to a Berkeley HardFloat baseline while claiming bit-identical numerical accuracy to NVIDIA Tensor Cores.
Significance. If the reported FPGA measurements and numerical equivalence hold, the work supplies a concrete, reproducible open-source building block for low-precision tensor operations on an open GPGPU platform. The fused architecture and concrete post-synthesis numbers (frequency, latency, throughput, area) constitute a useful reference point for the community working on hardware accelerators for deep learning.
major comments (1)
- [§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.
minor comments (2)
- [Table 2] Table 2 (resource utilization): clarify whether the reported LUT/FF/DSP counts include or exclude the MX scaling logic and sparse-gating circuitry.
- [Figure 4] Figure 4 (pipeline diagram): the boundary between the fused FP and INT paths is not labeled with cycle-accurate stage boundaries, making it difficult to verify the stated 4-cycle latency.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive comment on the verification aspects of our work. We address the major comment point by point below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.
Authors: We agree that the current manuscript does not provide explicit test-vector suites, coverage metrics, or side-by-side tables for the corner cases in FP8/BF8/INT4. While our internal verification process included targeted test vectors for denormals, NaNs, and accumulation overflow (generated both randomly and from known edge-case patterns) and confirmed bit-identical behavior against separate Berkeley HardFloat units as well as matching NVIDIA Tensor Core results where defined, these details were omitted due to page limits. In the revised manuscript we will expand the verification subsection in §5 to include: (1) a description of the test-vector generation methodology, (2) coverage metrics for the relevant IEEE 754 and MX corner cases, and (3) concise side-by-side comparison tables for representative denormal, NaN, and overflow scenarios. This addition will make the numerical-equivalence claims fully reproducible without altering the reported results. revision: yes
Circularity Check
No circularity: performance metrics are direct FPGA synthesis results
full rationale
The paper reports an FPGA implementation of a fused dot-product unit with measured outcomes (4-cycle latency at 262.325 MHz, 134.308 GFLOPS, ~3.1x speedup, <60% area) obtained from synthesis and timing analysis on the Alveo U55C. These are empirical hardware results rather than predictions or derivations that reduce to fitted parameters or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that loop back to the inputs by construction. The numerical-accuracy claim is presented as a design goal matching NVIDIA Tensor Cores but is not used as a load-bearing derivation step within the paper itself. The contribution is therefore self-contained as an implementation artifact.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of synchronous digital design, FPGA synthesis tools, and IEEE floating-point rounding modes hold for the target platform.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a configurable 4-stage fused dot product architecture supporting low-precision (FP16/BF16/FP8/BF8) multiplication with FP32 accumulation... MOD-4 CSA accumulator structure.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Patterson, Brian Richards, Colin Schmidt, Stephen Twigg, Huy Vo, and Andrew Waterman
Krste Asanović, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, David Bian- colin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, Adam Izraele- vitz, Sagar Karandikar, Ben Keller, Donggyu Kim, John Koenig, Yunsup Lee, Eric Love, Martin Maas, Albert Magyar, Howard Mao, Miquel Moreto, Albert Ou, David A. Patterson, Brian Richards, Colin Schmi...
work page 2016
-
[2]
Luca Bertaccini, Gianna Paulin, Matheus Cavalcante, Tim Fischer, Stefan Mach, and Luca Benini. 2024. MiniFloats on RISC-V Cores: ISA Extensions With Mixed- Precision Short Dot Products.IEEE Transactions on Emerging Topics in Computing 12, 4 (2024), 1040–1055. doi:10.1109/TETC.2024.3365354
-
[3]
Tom M. Bruintjes, Karel H. G. Walters, Sabih H. Gerez, Bert Molenkamp, and Gerard J. M. Smit. 2012. Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic.ACM Trans. Archit. Code Optim.8, 4, Article 41 (Jan. 2012), 22 pages. doi:10.1145/2086696.2086720
-
[4]
Stef Cuyckens, Xiaoling Yi, Nitish Satya Murthy, Chao Fang, and Marian Verhelst
-
[5]
arXiv:2505.22404 [cs.AR] https://arxiv.org/abs/2505.22404
Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning. arXiv:2505.22404 [cs.AR] https://arxiv.org/abs/2505.22404
-
[6]
Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Fu...
work page 2021
-
[7]
John R. Hauser. 2019. Berkeley HardFloat Floating-Point Arithmetic Package, Re- lease 1. https://www.jhauser.us/arithmetic/HardFloat.html. Accessed: September 5, 2025
work page 2019
-
[8]
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] https://arxiv.org/abs/1804.06826
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Hansung Kim, Ruohan Richard Yan, Joshua You, Tieliang Vamber Yang, and Yakun Sophia Shao. 2025. Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ...
-
[10]
Qiong Li, Chao Fang, and Zhongfeng Wang. 2023. PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications. In2023 IEEE International Sym- posium on Circuits and Systems (ISCAS). IEEE, USA, 1–5. doi:10.1109/ISCAS46773. 2023.10182007
-
[11]
Stefan Mach, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing.IEEE Transactions on Very Large Scale Integration (VLSI) Systems29, 4 (2020), 774–787
work page 2020
-
[12]
Abubakr Nada, Giuseppe Maria Sarda, and Erwan Lenormand. 2025. Coopera- tive Warp Execution in Tensor Core for RISC-V GPGPU. In2025 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). 1422–1436. doi:10.1109/HPCA61900.2025.00107
-
[13]
2017.NVIDIA Tesla V100 GPU Architecture
NVIDIA Corporation. 2017.NVIDIA Tesla V100 GPU Architecture. Techni- cal Report. https://images.nvidia.com/content/volta-architecture/pdf/volta- architecture-whitepaper.pdf
work page 2017
-
[14]
Md Aamir Raihan, Negar Goli, and Tor M. Aamodt. 2019. Modeling Deep Learning Accelerator Enabled GPUs. In2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 79–92. doi:10.1109/ISPASS.2019.00016
-
[15]
Jongwook Sohn and Earl E. Swartzlander. 2016. A Fused Floating-Point Four- Term Dot Product Unit.IEEE Transactions on Circuits and Systems I: Regular Papers63, 3 (2016), 370–378. doi:10.1109/TCSI.2016.2525042
-
[16]
Blaise Tine and Nikhil Rout. 2025. Vortex GPGPU Tensor Core Unit Extension FEDP DRL RTL Backend. https://github.com/vortexgpgpu/vortex/tree/bug_fixes/ hw/rtl/tcu/drl
work page 2025
-
[17]
Blaise Tine, Krishna Praveen Yalamarthy, Fares Elsabbagh, and Kim Hyesoon
-
[18]
Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,
Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics. InMICRO- 54: 54th Annual IEEE/ACM International Symposium on Microarchitecture(Virtual Event, Greece)(MICRO ’21). Association for Computing Machinery, New York, NY, USA, 754–766. doi:10.1145/3466752.3480128
-
[19]
Hao Zhang, Dongdong Chen, and Seok-Bum Ko. 2019. Efficient Multiple-Precision Floating-Point Fused Multiply-Add with Mixed-Precision Support.IEEE Trans. Comput.68, 7 (2019), 1035–1048. doi:10.1109/TC.2019.2895031
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.