pith. sign in

arxiv: 2512.00053 · v2 · submitted 2025-11-19 · 💻 cs.AR

Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Pith reviewed 2026-05-17 20:32 UTC · model grok-4.3

classification 💻 cs.AR
keywords fused dot productmixed-precision arithmetictensor coreGPGPUFPGA implementationRISC-Vmatrix multiply-accumulateopen-source hardware
0
0 comments X

The pith

Ten-Four fuses floating-point and integer pipelines into one dot-product unit that runs mixed-precision matrix operations in four cycles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ten-Four as a scalable mixed-precision fused dot product unit built for the open-source Vortex GPGPU Tensor Core. It combines floating-point and integer arithmetic paths to handle multiplications in FP16, BF16, FP8, BF8, INT8, and INT4 formats while accumulating results in FP32 or INT32. The design adds native Microscaling support and sparse lane clock-gating for power savings. On an AMD Xilinx Alveo U55C FPGA it reaches 4-cycle latency at 262.325 MHz, yielding 134.308 GFLOPS per Tensor Core and a 3.1 times speedup over a Berkeley HardFloat version at under 60 percent the area while matching NVIDIA numerical accuracy. This matters for open-source GPGPU development because discrete arithmetic units have historically added latency, rounding error, and wasted silicon in deep-learning accelerators.

Core claim

Ten-Four integrates both the floating-point and integer arithmetic pipelines within a single fused architecture that supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native Microscaling and sparse lane clock-gating, achieving 4-cycle operation latency at 262.325 MHz Fmax and 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA while delivering approximately 3.1 times the performance of an equivalent Berkeley HardFloat-based implementation at less than 60 percent the area cost and matching NVIDIA Tensor Core numerical accuracy.

What carries the argument

A single fused dot-product architecture that merges floating-point and integer pipelines to perform multiplication and accumulation without intermediate rounding or separate units.

If this is right

  • Matrix-multiply-accumulate operations inside open-source GPGPUs can now complete in four cycles instead of the higher latency of discrete units.
  • Resource utilization improves because a single pipeline replaces multiple separate arithmetic blocks.
  • Dynamic power drops further through built-in sparse lane clock-gating when many lanes are inactive.
  • Designers gain an open-source drop-in unit that already matches commercial Tensor Core accuracy for mixed-precision workloads.
  • The same fused structure scales to additional low-precision formats without redesigning separate adders or multipliers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other open-source GPU projects could adopt the same fused pipeline to reduce their own Tensor Core area and latency budgets.
  • Real silicon measurements on a fabricated chip rather than FPGA emulation would reveal whether clock frequency or power numbers shift under sustained AI workloads.
  • The Microscaling support already present could be extended to newer formats such as FP4 or FP6 once the base unit is verified.
  • Integration with higher-level compilers would let software teams automatically choose the fused unit for any matrix operation that matches the supported precisions.

Load-bearing premise

The fused pipeline produces exactly the same numerical results as separate discrete units for every supported format and every input pattern that arises inside the full Vortex Tensor Core.

What would settle it

A side-by-side numerical comparison of Ten-Four outputs against a reference discrete-unit implementation for thousands of random and corner-case inputs across all six multiplication formats, or a full integration test inside the Vortex Tensor Core that shows any deviation in accumulated results.

Figures

Figures reproduced from arXiv: 2512.00053 by Blaise Tine, Nikhil Rout.

Figure 1
Figure 1. Figure 1: GPGPU Mixed-Precision Fused Dot Product Unit 4-Stage Pipeline Microarchitecture [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FEDP Backends Performance Scaling (FP16/BF16) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
read the original abstract

Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic units, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a scalable mixed-precision fused dot product unit that integrates both the floating-point and integer arithmetic pipelines within a single fused architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native support for Microscaling (MX) and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core's numerical accuracy. Ten-Four achieves 4-cycle operation latency at 262.325 MHz Fmax, delivering 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA, demonstrating ~3.1x performance improvement over an equivalent Berkeley HardFloat-based implementation at less than 60% the area cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Ten-Four, a scalable open-source fused dot-product unit for mixed-precision MMA operations integrated into the Vortex RISC-V GPGPU Tensor Core. It fuses FP and INT pipelines to support multiplication in FP16/BF16/FP8/BF8/INT8/INT4 with accumulation in FP32/INT32, adds native MX microscaling and sparse lane clock-gating, and reports 4-cycle latency at 262.325 MHz Fmax on the AMD Xilinx Alveo U55C, delivering 134.308 GFLOPS per Tensor Core with ~3.1× throughput improvement and <60 % area relative to a Berkeley HardFloat baseline while claiming bit-identical numerical accuracy to NVIDIA Tensor Cores.

Significance. If the reported FPGA measurements and numerical equivalence hold, the work supplies a concrete, reproducible open-source building block for low-precision tensor operations on an open GPGPU platform. The fused architecture and concrete post-synthesis numbers (frequency, latency, throughput, area) constitute a useful reference point for the community working on hardware accelerators for deep learning.

major comments (1)
  1. [§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.
minor comments (2)
  1. [Table 2] Table 2 (resource utilization): clarify whether the reported LUT/FF/DSP counts include or exclude the MX scaling logic and sparse-gating circuitry.
  2. [Figure 4] Figure 4 (pipeline diagram): the boundary between the fused FP and INT paths is not labeled with cycle-accurate stage boundaries, making it difficult to verify the stated 4-cycle latency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment on the verification aspects of our work. We address the major comment point by point below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.

    Authors: We agree that the current manuscript does not provide explicit test-vector suites, coverage metrics, or side-by-side tables for the corner cases in FP8/BF8/INT4. While our internal verification process included targeted test vectors for denormals, NaNs, and accumulation overflow (generated both randomly and from known edge-case patterns) and confirmed bit-identical behavior against separate Berkeley HardFloat units as well as matching NVIDIA Tensor Core results where defined, these details were omitted due to page limits. In the revised manuscript we will expand the verification subsection in §5 to include: (1) a description of the test-vector generation methodology, (2) coverage metrics for the relevant IEEE 754 and MX corner cases, and (3) concise side-by-side comparison tables for representative denormal, NaN, and overflow scenarios. This addition will make the numerical-equivalence claims fully reproducible without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: performance metrics are direct FPGA synthesis results

full rationale

The paper reports an FPGA implementation of a fused dot-product unit with measured outcomes (4-cycle latency at 262.325 MHz, 134.308 GFLOPS, ~3.1x speedup, <60% area) obtained from synthesis and timing analysis on the Alveo U55C. These are empirical hardware results rather than predictions or derivations that reduce to fitted parameters or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that loop back to the inputs by construction. The numerical-accuracy claim is presented as a design goal matching NVIDIA Tensor Cores but is not used as a load-bearing derivation step within the paper itself. The contribution is therefore self-contained as an implementation artifact.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard digital design assumptions and re-uses existing open-source arithmetic libraries rather than introducing new mathematical axioms or fitted constants.

axioms (1)
  • standard math Standard assumptions of synchronous digital design, FPGA synthesis tools, and IEEE floating-point rounding modes hold for the target platform.
    Invoked implicitly when claiming 4-cycle latency and numerical accuracy matching NVIDIA Tensor Cores.

pith-pipeline@v0.9.0 · 5522 in / 1332 out tokens · 30296 ms · 2026-05-17T20:32:14.255485+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    Patterson, Brian Richards, Colin Schmidt, Stephen Twigg, Huy Vo, and Andrew Waterman

    Krste Asanović, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, David Bian- colin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, Adam Izraele- vitz, Sagar Karandikar, Ben Keller, Donggyu Kim, John Koenig, Yunsup Lee, Eric Love, Martin Maas, Albert Magyar, Howard Mao, Miquel Moreto, Albert Ou, David A. Patterson, Brian Richards, Colin Schmi...

  2. [2]

    Luca Bertaccini, Gianna Paulin, Matheus Cavalcante, Tim Fischer, Stefan Mach, and Luca Benini. 2024. MiniFloats on RISC-V Cores: ISA Extensions With Mixed- Precision Short Dot Products.IEEE Transactions on Emerging Topics in Computing 12, 4 (2024), 1040–1055. doi:10.1109/TETC.2024.3365354

  3. [3]

    Bruintjes, Karel H

    Tom M. Bruintjes, Karel H. G. Walters, Sabih H. Gerez, Bert Molenkamp, and Gerard J. M. Smit. 2012. Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic.ACM Trans. Archit. Code Optim.8, 4, Article 41 (Jan. 2012), 22 pages. doi:10.1145/2086696.2086720

  4. [4]

    Stef Cuyckens, Xiaoling Yi, Nitish Satya Murthy, Chao Fang, and Marian Verhelst

  5. [5]

    arXiv:2505.22404 [cs.AR] https://arxiv.org/abs/2505.22404

    Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning. arXiv:2505.22404 [cs.AR] https://arxiv.org/abs/2505.22404

  6. [6]

    Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Fu...

  7. [7]

    John R. Hauser. 2019. Berkeley HardFloat Floating-Point Arithmetic Package, Re- lease 1. https://www.jhauser.us/arithmetic/HardFloat.html. Accessed: September 5, 2025

  8. [8]

    Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] https://arxiv.org/abs/1804.06826

  9. [9]

    Hansung Kim, Ruohan Richard Yan, Joshua You, Tieliang Vamber Yang, and Yakun Sophia Shao. 2025. Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ...

  10. [10]

    Qiong Li, Chao Fang, and Zhongfeng Wang. 2023. PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications. In2023 IEEE International Sym- posium on Circuits and Systems (ISCAS). IEEE, USA, 1–5. doi:10.1109/ISCAS46773. 2023.10182007

  11. [11]

    Stefan Mach, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing.IEEE Transactions on Very Large Scale Integration (VLSI) Systems29, 4 (2020), 774–787

  12. [12]

    Abubakr Nada, Giuseppe Maria Sarda, and Erwan Lenormand. 2025. Coopera- tive Warp Execution in Tensor Core for RISC-V GPGPU. In2025 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). 1422–1436. doi:10.1109/HPCA61900.2025.00107

  13. [13]

    2017.NVIDIA Tesla V100 GPU Architecture

    NVIDIA Corporation. 2017.NVIDIA Tesla V100 GPU Architecture. Techni- cal Report. https://images.nvidia.com/content/volta-architecture/pdf/volta- architecture-whitepaper.pdf

  14. [14]

    Md Aamir Raihan, Negar Goli, and Tor M. Aamodt. 2019. Modeling Deep Learning Accelerator Enabled GPUs. In2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 79–92. doi:10.1109/ISPASS.2019.00016

  15. [15]

    Swartzlander

    Jongwook Sohn and Earl E. Swartzlander. 2016. A Fused Floating-Point Four- Term Dot Product Unit.IEEE Transactions on Circuits and Systems I: Regular Papers63, 3 (2016), 370–378. doi:10.1109/TCSI.2016.2525042

  16. [16]

    Blaise Tine and Nikhil Rout. 2025. Vortex GPGPU Tensor Core Unit Extension FEDP DRL RTL Backend. https://github.com/vortexgpgpu/vortex/tree/bug_fixes/ hw/rtl/tcu/drl

  17. [17]

    Blaise Tine, Krishna Praveen Yalamarthy, Fares Elsabbagh, and Kim Hyesoon

  18. [18]

    Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,

    Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics. InMICRO- 54: 54th Annual IEEE/ACM International Symposium on Microarchitecture(Virtual Event, Greece)(MICRO ’21). Association for Computing Machinery, New York, NY, USA, 754–766. doi:10.1145/3466752.3480128

  19. [19]

    Hao Zhang, Dongdong Chen, and Seok-Bum Ko. 2019. Efficient Multiple-Precision Floating-Point Fused Multiply-Add with Mixed-Precision Support.IEEE Trans. Comput.68, 7 (2019), 1035–1048. doi:10.1109/TC.2019.2895031